As I am traveling around Europe this week, giving seminars in several places (Hamburg yesterday, Berlin today, and Clermont-Ferrand on Friday) my connectivity is erratic and my capability to follow the development of data analysis and new publications is strongly lowered. My connections to the world of LHC research continues through email exchanges, though.
One thread I found in my mailbox this morning developed from a question on the use and meaning of the so-called “Brazil Bands”, a feature of commonly used plots that summarize the searches for new phenomena. They have become ubiquitous in the results that particle physicists produce. Not surprisingly, the initiator of the thread is a statistician, not a physicist. Indeed, the two worlds remain miles apart, despite efforts to bridge the communication gap that exists between the two disciplines. We physicists do use techniques and tools borrowed from statisticians, but over the years and through decades of (mis-)practice we have adapted them to our liking, and it now looks as if we speak a different language. Anyway, I should leave to a different post the discussion of the fuzzy and entertaining map that one may draw between the words used in statisticians’ and physicists’ jargon, so let me rather go back to the topic I want to discuss here today: Brazil Bands, indeed.
Testing the null hypothesis
The general setup is the one of a search for a new particle or effect in the data, which we can characterize by a “signal strength”: this is zero if the phenomenon does not exist (your “null hypothesis”, according to which the data conforms to the physics model you trust and take as a reference in the absence of the new particle), and it is non-zero if the phenomenon does exist (what we may call the “alternate hypothesis); the usual example is the cross section for a particle production process, in other words the “rate” of signal events; or quantities related to it.
In our search we want to determine if the data are compatible with the absence of a new physics signal (i.e., if the estimated production rate of that signal is compatible with zero) or if instead they suggest the presence of the signal. What we do in the two cases is different: in the former case our task becomes the one of putting an “upper limit” on the signal strength; in the latter, more exciting instance we instead need to quantify it, by placing a confidence interval around our measured value. Here we concentrate on the case of setting an upper limit on signal strength, as the technology to do that is a bit different, although Brazil Bands are also used in the other case.
Usually our estimate of the effect we are trying to measure depends on a nuisance parameter – a unknown quantity that the considered new physics model does not specify a priori. The standard example of this is the mass of the new particle: our search results – the upper limits we obtain on the signal rate, given some analysis and a set of data – depend on the hypothetical value of the particle mass, so we need to determine them as a function of it.
To keep the abstraction level of this post from diverging, it is a good idea to look at one such result. This will hopefully clarify what are the ingredients I have been discussing until here. The graph below shows the upper limit on the rate of a particular decay of a hypothetical new particle decay (it is irrelevant what that is here, so please do not ask) as a function of its mass. The black line shows the calculated limit, which as you can see does depend on the assumed particle mass; in fact it decreases with it (this is entirely accidental and might be different in other searches). And then there is a dashed line, and a green and yellow band around it. That, you guessed correctly, is the Brazil Band I wish to discuss today.
First let us just look at the graph: there is a huge amount of information here. You can see how the calculated upper limit (I will also call it “observed limit” from now on) fluctuates a little, while the other line, which I will call “expected limit”, is smoother. It looks as if the observed limit dances around the expected one as the mass goes from low to high values. This is good, as I will explain later. And also, it looks as if the observed limit stays within the yellow band, although it does depart from the green band here and there. We will call the yellow band “two-sigma expected band” and the green one “one-sigma expected band” in the following.
What’s an expected limit anyway?
What is an expected limit ? It is the result of assuming that there is no new particle in the data, AND that the data behaves as expected by our baseline model (the null hypothesis). When I say “behaves as expected”, I have a precise meaning in mind, but explaining that requires some additional explanation. What we can note already is the fact that full and dashed black curves are more or less overlaid is a “feel good” message from the graph – it tells the user that indeed, not only does the data conform to the null hypothesis, but also that they bear no surprise in any way.
Since we are physicists, we like to resort to qualitative reasoning for starters, but when we do business we always need to turn into quantitative mode. So rather than a “feel good” statistic about the conforming of the observed data to the expected ones, we need a to quantify – if only visually – how much they do so. Enter the bands. They are obtained by considering separately each value of the nuisance parameter (the particle mass), so we can describe how their boundaries are determined by considering one particular mass value.
Imagine, for instance, that we expect to observe 100 events of the considered kind in the data we selected when we sought for a particle of mass M (our selection may be different for different mass values). Statistical fluctuations may make that number become 90, or 110, or anything in the whereabouts in a real-life experiment. The distribution of possible number of events obtained in a real experiment determines how strong or weak will be our upper limit on the particle decay rate, since if we see more than 100 events our limit is weak (we have an excess from the prediction of the “null hypothesis”, so our best guess of the number of events originated from the new process is larger than zero!). If we see less than 100 events, conversely, the limit is stronger. How do we go from that to the green and yellow bands ?
The bands are computed by performing “pseudo-experiments” with a computer simulation. The simulation draws many datasets generated from the null hypothesis, by allowing for statistical fluctuations as well as fluctuations of the systematic uncertainties. This produces a large set of virtual upper limits for the quantity of interest: a distribution which will inform us on the relative probability of the different outcomes of our experiment. By considering the 2.5%, the 16%, the 84%, and the 97.5% quantiles of that distribution we thus obtain the -2sigma, -1sigma, +1sigma, and +2 sigma boundaries of the “expected limit”, for the considered mass value M. Of course, the median (the 50% quantile) determines the value of the expected limit itself – the dashed curve in the graph above. By repeating the calculation for all the relevant mass values, we get curves which are then pictured as green and yellow colored bands. The fact that the black curve is more or less contained within the yellow band is a more quantitative, if still imperfect, way of conveying the message: there is no departure of the data from the expectation for any hypothesized mass of the sought particle.
Notes for the Advanced User
There are a number of considerations one can make by considering the details of the construction sketched above. I will briefly mention them below.
1) the searches done at different mass values are usually highly correlated to one another (you have only one dataset and you extract inference on the presence of the particle at different mass values without changing your selection, e.g.) but they might also be uncorrelated (if you consider entirely different data for different mass points). This has an effect on how much you should expect the observed limit to move in and out of the 1- and 2-sigma bands. In the limit of full correlation, the limit will more or less stay at the same quantile throughout the mass spectrum.
2) The vertical extension of the expected bands tells the user some detail on the search, i.e. how much the result is dependent on statistical fluctuations, for instance. But there is a fine print here: sometimes the pseudoexperiments that determine the quantiles of the expected limit are obtained by considering background predictions obtained a priori (i.e. before looking at the data), other times they are obtained a posteriori (i.e. by “conditioning” to the actual estimate of the expected background, or other relevant systematic uncertainty that the data does constrain). The issue of “pre-fit versus post-fit expectations” is not for this blog, however, as there is a sea of subtleties I cannot tackle in this space.
3) Can the agreement between observed limit and expected bands be taken as a goodness-of-fit measure ? This was the original question of the thread I mentioned above. Indeed, it can, to some extent, but with the caveat that the user very seldom has a full grasp of the amount of correlation between the different determinations of the limit at different mass values. Only in the case of complete non-correlation of the various determinations can one eyeball how often does the observed limit depart from the 1-sigma band, and compare that fraction to 68% (the expected fraction of agreements!).
There would be more to say about this topic (one hint: there are subtleties on how you generate your pseudo-experiments in the calculation of the distribution of expected limits!, as you need to vary the value of each systematic source every time, e.g.) but I think I will stop here for today. You can however ask questions on this topic in the thread below…