by AMVA4NewPhysics press office

And here it is, the second – but really synchronous in publication with the first – scientific deliverable of our network. Deliverable 4.1, titled “Report of the Performance of Algorithms for Data-Driven Background Shape Modeling“, is a report of studies performed by network members operating within Work Package 4, also known as “New Statistical Learning Tools for HEP Analysis“.

The research presented in this document aims at constructing a precise representation of background processes to searches for small signals in hadron collider data. Specifically, we focused on the multijet QCD background, which is a process whereby proton-proton collisions yield a large (>=4) number of hadronic jets. Since the same signature is of interest for the study of many rare phenomena of interest in high-energy physics at the LHC, and since the absolute rate of background processes is much larger than that of all those potential signals, the precise modeling of the multijet background is a very important tool for us.

In this report we considered three background-estimating methods entirely based on real data – thus removing the need to deal with the limitations of Monte Carlo simulations, which struggle with the complexity of the final state (introducing systematic uncertainties) and its large production rate (where CPU limits introduce statistical uncertainties). While the first one, known as “matrix-based b-tagging rate parametrization” is not a Statistical Learning algorithm per se, and is only discussed as a reference point and as an historical introduction to the topic, the other two belong to the area of supervised learning.

In the end, the report discusses in the greatest detail a method which is entirely new and has been developed specifically for the needs of the AMVA4NewPhysics programme. It is called “hemisphere mixing”, and is based on the construction of an artificial dataset by exploiting the approximate independence of the two final-state partons emitted, in the leading-order approximation, in QCD collisions; these are “two-to-two” processes – either space-like or time-like. At leading order, the two partons are emitted back-to-back in azimuth with equal transverse momenta, thus creating a “transverse thrust axis” which can be effectively recognized also at detector reconstruction level and used as a reference value to split the event in two almost independent parts. The intricacies of color radiation, multiple gluon emission, forward and backward evolution, pileup effects, multiple parton scattering, and what have you, all conspire to make a multijet event quite complex to handle. And yet, if one can see a simplicity in the complexity, by considering the transverse thrust axis, one can devise a method to exploit it.

In the graphs below is given a quick visual illustration of the procedure, which consists in creating a library of hemisperes (half events constructed by cutting data events orthogonally to their transverse thrust), which is then parsed to create “mixed events”. The mixed sample retains all the interesting kinematical features of the original one, allowing for a very precise and fully data-driven model. I do not venture to explain its working in detail here though, so you will have to download the report if you want to know the details!



The method is verified to work with the help of a formal statistical hypothesis testing approach. The test is quite complex and interesting to read about by itself. The method is found quite suitable for the applications we have in mind – namely, the search for Higgs pair production in the four b-quark final state, a process that is also the focus of the other deliverable just released by the network, D1.1 (you can read about that one, and download it here).

Below (left) you can see a comparison of background shapes for some kinematic variables in the original dataset (blue) and the artificial one (in red – almost invisible as it’s completely overlaid!). The densities are compared on the left, and the ratio between them is shown in the right panels, highlighting the effectiveness of the model.


Congratulations to all AMVA4NewPhysics members involved! And on to the next due reports, which are due at the end of August!