Today I’m very happy as Pablo and I have been able to prove – to ourselves, and to our colleagues in Padova – that a novel technique we devised to model the background from multijet production works.

One of the challenges one faces when searching for small signals in hadron collider experiments is the QCD background. You can get done with it if your signal contains lots of isolated leptons, as muons and electrons are not produced by quark-gluon interactions; but if your final state of interest includes only hadronic jets, as is the case in the search of Higgs pairs in 2 x b-antib quark final states, you have to learn to live with it.

It’s not by chance that the proposal of the AMVA4NewPhysics network contains among others the promise of a specific deliverable – a new technique for background modeling. But I must confess that when I wrote that part of the proposal I had no clue on how we’d pull off an innovative technique for QCD background modeling! I only knew it had to be done.

When you have to model a background you can usually rely on Monte Carlo simulations. But with QCD it is harder, because the processes have huge cross sections so no matter how much CPU you put at work to simulate your events, you end up with too few good events to do a proper modeling job.

“What is this modeling you’re talking about anyway” could be your thought at this point. Okay, let me explain. Imagine you do some data selection and then you reconstruct the mass of the two Higgs bosons, using measured energies of hadronic jets you think have originated from the Higgs decays.

At that point you create a 2-dimensional graph of mass_1 versus mass_2. Then you want to interpret that distribution, by “fitting” it to the sum of background and signal. For the signal, you can rely on a Monte Carlo simulation. but for the background, you usually cannot, for the reasons stated above.

The picture on the right shows the 2D distribution of a dataset made up of generator-level QCD plus generator-level HH signal (left) and HH signal alone (right). They are distinguishable, but what is the QCD shape alone? You can use the QCD Monte Carlo to estimate it, but you will have large statistical uncertainties because no QCD Monte Carlo dataset is large enough…

The technique we devised works by cutting data events in two hemispheres, creating a library of hemispheres. Then every other event can be modeled by mixing and matching random hemispheres that fulfil some constraints – they overall reproduce the kinematical characteristics of the event to be modeled.

I cannot disclose the details of the procedure here, as I want to first publish it; but it suffices that I show a graph with a nice result. You can see below two distributions. These are the masses of the leading and trailing Higgs, mass_1 and mass_2, for events with 4 b-tags (that is a requirement that selects a signal-enriched region).

In the graph the blue crosses show the original distributions (the data one wants to model). The blue histogram is instead the fraction of the data due to QCD, and the black histogram is the fraction of the data due to HH production. (In this mock dataset, all made of generator-level Monte Carlo, we have blown up the HH fraction by a large factor, to see it more clearly.) The red histogram is the model made by mixing and matching hemispheres, renormalized to the integral of the QCD distribution alone. It matches it very well!

Being unable to disclose the details, I cannot let this graph awe you as much as it would deserve. But it suffices to say that we have solved a very important problem in our analysis!