In high-energy physics problems, one usually encounters the need of discriminating the signal one wants to study – typically a new physical process yet to be discovered – from a background constituted by known processes.


This is a classification problem, where you may use the known features of the signal to distinguish it from backgrounds. A large number of possible multivariate analysis tools can do that for you: neural networks, boosted decision trees, nearest-neighbor algorithms. These are collectively called “MVA”.

An MVA usually returns a single number whose value is quite different for signal and background. By selecting data with a high value of the output, one is capable of enriching the selected data of the sought signal.

Classification MVA tools stop there: you obtain a higher signal-to-noise ratio in the selected sample, but you still must apply some other statistical technique – say a likelihood fit – to demonstrate the presence of a signal in the data.

A nagging issue arises when the features you use to “train” the MVA, while being good handles to discriminate the signal, make it harder to later distinguish signal and background with your likelihood fit.

A typical example of this embarrassing issue is when you are looking for a resonance – say the Higgs boson. You decide that you will extract the signal by a likelihood fit to the invariant mass distribution of the data, after having selected a signal-enriched sample with a cut on the MVA output.  If the variables you feed the MVA with include the momenta of decay products of the Higgs boson (something you would want to do, as they have a significant discrimination power), this will have the effect that data at high MVA score “peak” in the invariant mass distribution, even if they are background!

A poor man’s solution to this problem is to restrain themselves in the feeding of the MVA. However, you clearly see this is sub-optimal: you are not using all the available information in the discrimination step of your analysis.

A better solution is to build a meta-MVA, a tool which does not just try to correctly classify signal and background, providing a discriminating variable. This meta-MVA must know about the way you intend to fit the selected data, and find the optimal use of the features of the data by maximizing some appropriate score. If this score were the inverse of the relative uncertainty in the signal fraction extracted by the likelihood fit, you would be done! The machine does all the dirty work at once, and you are left with the best possible selected dataset.

There are ways to do this. The architecture of the multivariate algorithm becomes significantly more complex, but it is tractable. In particular, one can simplify the “emulation” of the likelihood fit step in the architecture, by finding analytical forms for the relative uncertainty on the fitted signal using e.g. the Cramer-Rao-Frechet bound. But I realize this is getting rather technical and maybe also the subject of a different post…

Still, I thought I would write here about this idea today as I believe this is the way to go if we want to improve our statistical learning tools for high-energy physics. The more we think about this issue, the better tools we can eventually put together!