by Fabricio Jiménez

This time I’m going to write about the ways of approaching general searches for new phenomena. I invite you to read my previous post to get some context, if you haven’t already. The essence of such searches is to explore as many signatures as possible, without assuming any model of new physics. But how does one do that?

During my last visit to CERN, I had the chance to meet in person for the first time with a couple of colleagues from the ATLAS general search group. We discussed, among many other things, the search algorithms used for this kind of analysis. (Simone Amoroso, one of the colleagues I’ve met, presents a concise historical review on general searches in the seventh chapter of his PhD thesis.) In this and the next post, I’ll try to depict two seminal works in the topic, one carried out at the HERA collider (DESY) and the other at the Tevatron (Fermilab).

Before moving on, it’s worth mentioning that the first known attempt to perform a general search happened in 1998, using data from L3, one of the experiments of the Large Electron Positron collider (LEP) at CERN. There, distributions of several angular and kinematic variables were explored and the comparison between data and simulation was performed using visual inspection or simple tests like Kolmogorov-Smirnov or χ². (It’s not my intention to expound on this example, but you can take a look at the note here if you’re interested.)

The common ground

A first feature appearing in all general searches is the separation of the events into exclusive categories. Each category is defined by a signature; for example, the events identified as “one electron and three jets” will contain exactly that amount of electrons and jets and not more.

An algorithm then takes care of searching for discrepancies between the data and simulations. The search is performed automatically in the spectra of chosen variables sensitive to new physics for each category. But searching in many distributions (remember that we’re talking about hundreds of categories) comes with a price to pay: the chance of a spurious discovery rises and one has to adjust for this effect.

The H1@HERA way

H1 was one of the multi-purpose detectors in the HERA electron-proton and positron-proton collider. The algorithm developed by the H1 collaboration searches for deviations between data and simulations on each binned distribution (four per category, in their case). The following estimator,


is used to identify the regions of most interest in the distribution.

The definition includes the two cases where the number of events predicted in a region (NSM) is greater or smaller than the one observed (Nobs). The factor A ensures normalization to one. The estimator p is then a convolution of the Poisson probability density function (pdf) with a Gaussian pdf (G) with mean NSM and width δNSM (i.e. the total systematic uncertainty for the expected events). The Poissonian and Gaussian functions take into account the statistical error and systematic uncertainties in p, respectively.

The estimator is interpreted as the p-value which is, as Tommaso has written in a comment of my previous post, the probability of “getting (…) data at least as discrepant with your model as the ones actually obtained.” Therefore, the most interesting region in a distribution is the one having the smallest p-value, pmin. Here an example:

Distribution of the sum of transverse momentum of the objects, in three categories. Taken from here.

There are two remaining steps to achieve a sensible result. The first is to calculate the probability P that a deviation with p-value pmin could have happened anywhere in the distribution. Again, the lower P, the more interesting the region selected.

The last step is the correction for searching in many categories, which is taken into account by generating pseudo experiments and comparing their values of P with those from the analysis with real data. The figure below shows the comparison in terms of the number of event classes with different values of P.

Comparison between the number of categories obtained with different values of P in real H1 data and in pseudo experiments. Taken from here.

Side note:
A pseudo experiment is a set of hypothetical Monte Carlo-generated histograms following the Standard Model expectation distribution with the same luminosity as the data recorded.

This general search was the first to explore all the final states accessible to the experiment, and no significant excesses with respect to the Standard Model were observed. This work has had an influence on other general searches outside H1, as in CDF and most notably in recent CMS and ATLAS studies.

That’s all for now. In the post for the second part, I’ll tell you about the other very interesting approach to general searches.