At the end of last week, the 1st AMVA4NewPhysics Workshop took place in Venice. It was quite useful for coordinating future network activities and meeting the other ESR fellows and members.  The workshop program also included several short scientific talks related to the different work packages. I had a small slot (15 minutes) for a talk myself, which was titled “Higgs pair production searches: a data scientist’s perspective“.

The aim of my talk was to explain the objective and design of a CMS/ATLAS data analysis without wearing the high energy physicist hat, so the workshop attendants which work in other fields (e.g. statistics or industry) could get an idea of the type of statistical problems we face in HEP and how we usually handle them.

Looking at things differently!

I thought this approach could also be interesting for the experimental high energy physicists in the room, because looking at the same problem from a different point of view can sometimes provide new insights. Given that we are currently working on it and that it will be a relevant topic of research for several other ESRs, I considered the study of Higgs pair production as main analysis example.

Furthermore, I also wanted to explain two issues that are of particular importance for the hh → bbbb analysis (i.e. jet combinatorics and multijet background estimation) and some methods we developed to deal with them.

It is all about statistical inference, not a classification problem! Image adapted from the rather awesome PhD Comics: The Higgs Boson Explained.

That was a lot of material, but while I was writing the abstract for the workshop, I was confident my objectives for the talk were achievable. However, when I was actually composing it I found that explaining CMS/ATLAS analyses for people that do not work in the field is not an easy task, especially given the talk time constraints.

I was able to put together something for the workshop, but it was not exactly the approach I initially had in mind. Nevertheless, the exercise of trying to describe HEP analyses from a non-domain-specific perspective was stimulating and led me to reconsider the way we do things. Here are some examples:

  • Statistical inference is the ultimate goal of every CMS/ATLAS data analysis. I have not found any counter-example, please tell me if you can think of one.
  • Why we pose the problem as signal vs background discrimination? If the aim is not the classification of events, but deducting properties of the underlying model, as suggested by the previous point.
  • Some words have HEP-specific meaning and should be used with care when talking to muggles. For example model, efficiency, pdf, event, process, sample, Monte Carlo,  etc.
  • How outdated/improvable are the tools and techniques we use? How do they compare with those used in other fields (e.g. applied statistics, experimental cosmology or machine learning)?

But there are many more. I could write a post on each of those, the good thing is that I cannot think of a better place for this type of discussion than this blog. Do not hesitate on commenting, I am eager to hear your opinion on these matters.

Data Science = Statistics?

Data Science is a buzzword these days, whose meaning depends on who you ask. Some people would say it is a rebranded name for referring to Statistics, there are even some jokes about it around the Internet:

“Data Science is statistics on a Mac.”

“A data scientist is a statistician who lives in San Francisco.”

Therefore, you might ask why did I choose a data scientist’s perspective and not a statistical perspective. Being a physicist and saying statistician’s perspective would have been weird, why did a data scientist’s perspective feel right?

I do not have a strong opinion on what Data Science is, in a sense that the exact definition does not matter much. It might be that Statistics is more about studying the techniques to extract information from the data while Data Science is more about extracting useful information. I like the concept depicted by the following Venn diagram:

Data Science Venn Diagram: an intuitive way to characterize the interdisciplinarity of the field. Created by Drew Conway.

In that paradigm, data science is a field created when statistical knowledge, computing skills and expertise in the domain of the data which you are analyzing overlap. It matches well with current experimental research in many scientific fields, including high energy physics. It is also well suited for an increasingly large number of positions in data-driven industries. For an interesting reflection on the interplay between academia and Data Science I recommend this blog post by Jake Vanderplas.

There is an initiative to promote the collaboration between the high energy physics community and other data-driven fields (especially machine learning), which is called Data Science @ LHC. I had the pleasure to attend a workshop they organized at CERN in November before I joined this project and it was quite insightful, you can checkout the recorded videos from the talks online. Enjoy!