by Pablo de Castro

Hey, it’s been a while since I wrote my last post! We’ve been quite preoccupied with hh → bbbb data analysis tasks and internal CMS presentations. I have several topics I want to talk about here so I will try to post more regularly in the following weeks. For today, I want to start a discussion about one important aspect of our job as scientists, which is software development.

Advice for raw pointer lovers: segfaults in C++ due to pointers could be all avoided by using RAII, smart pointers and STL containers. No need to use new and delete again!

Let’s consider the archetype of a member of this network, which we could define as an experimental particle physicist working at one of the LHC experiments, who is interested in statistical methods and machine learning. We will call him or her Andrea, which is a male name in Italy and female name in Spain, in order to to avoid cumbersome gender-neutral pronouns.

What is Andrea’s daily job like? Andrea spends most of Andrea’s time in front of a computer: replying to emails, attending video meetings, reading  papers, putting together some results in a set of slides or a note and sometimes writes blog posts. Andrea is involved in several data analyses, which are not straightforward and benefit from advanced statistical learning techniques.

Andrea is also a proficient programmer in a sense, learned Fortran  and C a while ago and is able to code almost anything using C-style C++ and the ROOT libraries. Andrea knows Andrea’s way around old school Unix commands and bash scripting, however modern tools or Python language do not seem very appealing to Andrea. Andrea works in data analysis’ code weekly  and contributes to the experiment framework every so often.

However, most of Andrea’s code is somehow monolithic, it is not under version control or continuous integration and it does not include unit testing. It might also lack proper documentation and be hard to comprehend by Andrea’s colleagues. In other words, Andrea is not a software developer, but a scientist.

Software is an essential component of research for every scientific discipline nowadays. Well-developed and openly accessible scientific software libraries and programs can indeed accelerate scientific progress, especially for large scale and collaborative projects as the LHC. For example in the high energy physics field, a mixture of software packages (e.g. MadGraph, Pythia and Geant) are used for physics-driven simulations of the high energy processes and the interaction of their products with the detectors.

For most data analyses, researchers reuse the data definition format and utilities provided by ROOT. Furthermore, each experiment has its own custom software framework which integrates many libraries and tools with specific code and configurations. For example, the software framework of CMS, referred to as CMSSW is the largest scientific software product on GitHub (a popular repository for software source code) by number of commits (i.e. changes).

In addition to the experiment framework, each small analysis group usually has some shared code which is typically what the researchers actually work most of the time on. Therefore, a rather complex software ecosystem is required for carrying out data analysis at LHC and that without even considering the tools used to manage resources, data and jobs on the Grid (distributed computing infrastructure around the world) and local clusters.

CMSSW is the largest scientific project in GitHub by number of commits. While it is a bit messy due to the large number of contributions, good software development practices (control version, fork and pull request model, automated builds and tests) make it manageable!

If most physicists do not follow software development best practises, how is a scientific software  ecosystem as the one mentioned sustainable?

It turns out that Andrea is not an archetype member of this network nor of any LHC experiment. The level of expertise in software development practices in a scientific project varies greatly. Some people involved typically are awesome software developers and are able to integrate and manage code made by others as well as teach them.

Nevertheless, I really think that being a physicist is not an excuse for not following good programming style and practise when working with others, especially given the large number of learning resources currently available online. I am especially fond of two non-profit projects that focus on providing resources and organizing events to improve computing skills in scientific research. One is lead by Software Carpentry and the other is lead by Mozilla Science Lab. There you can find some nicely curated lessons on basic software development practices.

I am also thinking about writing a small series of posts/tutorials about some computing tools and practices that make my life easier as a researcher, and which might be also of use to you.

I would like to hear you opinion about this post, scientific software development or any other related topic, so don’t be shy in the comment section!