The search for non-resonant pair-production of Higgs bosons, with decays to b-quark pairs of both Higgses, is underway at INFN-PD. We plan to produce a result for Summer conferences.

Maybe the central task of this study is the understanding and reduction of the QCD background, which is due to multijet production with many b-quark jets in the final state. This background dominates whatever selection of events one deploys, as soon as three or four b-tagged jets are required.

Unfortunately, the QCD background is very hard to model effectively, as its huge cross section makes it almost impossible to generate enough Monte Carlo events to match the statistics of the data. But we need it, especially in order to train a multivariate (MVA) classifier to discriminate the tiny HH signal from QCD events.

CMS does produce centrally some samples of QCD multijet events, that are made available for analysts. However, the integrated luminosity of the resulting datasets is small if one is interested in low-energy phenomena with b-quarks (that generic QCD production does not single out events with b-jets). And yes, HH production is not such a high-energy phenomenon – competitive backgrounds have total transverse energies extending down to 200-300 GeV, where cross sections are huge.

In order to have enough Monte Carlo available for a precise training of an MVA, in INFN-PD we have decided to try and produce privately some QCD events by running only processes that have at least one b-quark in the final state. This makes sense, as those are the processes that contribute to our data after we select three or more b-tagged jets. But the task looks an impervious one: we estimate that we need >100 million events for a meaningful use case, and this means several hundred million seconds of CPU,as one event takes several seconds to be fully produced.

If you submitted, say, 1000 jobs in parallel to CRAB, the system handling a world-wide grid of computers we use for these tasks, and if all jobs ran smoothly, you’d be looking at a 10-day job. But there are of course dead times – the job of producing a reconstruction-level event requires multiple steps. And then there are hiccups of the system. And of course, the CPU resources are shared with thousands of users around the world, so your 1000 jobs are not always running at high speed.

Just for fun, below you can see a screenshot of the interface that displays the process of the submitted batches of jobs. It’s a complex system and I don’t remember anymore how to work with it – the last time I ran Monte Carlo simulations was in 2009! But although I am an old dog, I am still willing to (re-)learn…


All in all, it remains to be seen whether we will be able to pull this off. If we do not manage to get those data, we have other plans (one entails using the new event mixing technique I have described in the previous post), but I would be really happy if we got those large datasets in as soon as possible…

For the time being, the only colleague who is doing all the work is Mia Tosi, my ex PhD student, now a research fellow at CERN. She will hopefully teach the rest of us how to submit jobs in the proper way, so that we can parallelize the task of babysitting the jobs and making the production process a more effective one.