by Tommaso Dorigo

Yesterday, October 20, was the international day of Statistics. I took inspiration from it to select a clip from chapter 7 of my book Anomaly! Collider physics and the quest for new phenomena at Fermilab which attempts to explain how physicists use the concept of statistical significance to give a quantitative meaning to their measurements of new effects. I hope you will enjoy it….

—-

As we near the discussion of the discovery of the top quark, we need to make a digression to explain an important concept used by particle physicists to measure the level of surprise of an observation, i.e., how much are data at odds with a hypothesis. In a nutshell, the statistical significance of an observed effect (usually expressed as a “number of sigma”) is a function of the estimated probability to see an effect at least as surprising in the data. Z-values (as significances are sometimes called) and p-values (as statisticians name probabilities) are thus connected by a mathematical map, a one-to-one correspondence. As you are more familiar with the concept of probability, the way that map is constructed can be better explained by starting with an example involving probability estimates.

You and your friend Peter go to a Casino one evening, and quickly lose track of one another, as you join a table of Craps, while he moves to a Roulette table. Your game that evening turns out to be a quite extraordinary one, as at some point you throw 10 naturals in a row. At the game of Craps, a natural occurs when the two dice show a combination among the set (1,6), (2,5), (3,4), or (5,6). A streak of 10 naturals is really uncommon, and as you later start looking for Peter, you cannot wait to report your feat. When you finally find him, Peter looks even more excited than you, and he is the first to explain what happened to him: at his Roulette table, he observed a sequence of 20 rouges (omitting the times when the neutral zero came out). Once you also report your experience, a nagging question unavoidably arises: should the sequence he witnessed be regarded as more, equally, or rather less surprising than yours?

Admittedly, the surprise caused by an empirical observation can often be quite subjective. If, on the next day, you and Peter join a group of friends and tell them your Casino adventures, their answers to the question on the relative surprise level of the two observations might come in roughly equal proportions. Yet, there is of course only one right answer; the problem is mathematically well defined and can be solved by numerically comparing the probabilities of the two sequences of events. Since the throwing of the dice at the Craps table and the landing of the ball on the Roulette wheel may be supposed to be random processes, and each observed outcome should be independent of all others, then the probability of sequences of events is obtained in both cases as a simple product of relative frequencies. A natural occurs on average twice every nine trials, so the probability of the sequence of 10 naturals must be (2/9) to the 10th power, which is three-tenths of a millionth. For the Roulette sequence, the probability is instead (1/2) to the 20th power, which is just a bit less than one-millionth. Both series are thus quite unlikely, but in quantitative terms the sequence of dice throws is three times more unlikely than the sequence of rouges. If we were Casino managers in search of indicia of fraud, we would have to consider the dice throws a stronger notitia criminis of something awry than the Roulette results.

The above example indicates that very uncommon sequences of events force us to deal with very small p-value estimates. Now, there is no problem when one writes a small number in a formula: scientific notation comes to the rescue, and powers of 10 elegantly substitute cluttering strings of zeros. So we succinctly write 5 x 10^-9 in place of 0.000000005, for instance. Still, when one talks of a small number, one prefers to avoid spelling out those exponents. When they have to deal with very large or very small numbers, scientists in fact introduce multiples or submultiples: this usually calls for the use of the suffixes kilo-, Mega-, Giga-, Tera- or milli-, micro-, nano-, pico-, and so on. For p-values, however, the standard conversion does not involve powers of 10, but a slightly more complicated recipe based on the size of the area under the tail of a Gaussian function.

The Gaussian function, also called bell curve, is a cornerstone in statistics theory: it describes the typical probability distribution of measurement uncertainties. If you measure a physical quantity with an accurate instrument, the values you obtain by repeating the measurement usually scatter around the true value by following a Gaussian probability distribution, which characteristically predicts that larger and larger deviations are increasingly less frequent. The width of that Gaussian curve is proportional to the parameter called sigma, indicated by the Greek letter σ. For a Gaussian measurement, the sigma parameter coincides with the standard deviation; a smaller sigma corresponds to a more precise instrument, i.e., one which returns an answer within a smaller interval. In fact, the standard deviation of a set of data is a measure of their typical spread around their mean.

The conversion of a probability into the more manageable number of standard deviations – i.e., sigma units, or Z-values if you prefer – is performed by finding how far out in the tail of a Gaussian distribution, in units of sigma, one has to go in order that the area of the distribution from that point to infinity be equal to the stated probability. As a result of this procedure, a 16.3% probability can be seen to be equivalent to a one-standard deviation effect (boh-ring!); a 0.17% probability is a three-standard deviation effect (hmm, exciting!); and a 0.000029% probability corresponds to five standard deviations, five sigma (woo-hoo! a discovery!). Those numbers no doubt appear meaningless to you at first sight; yet, once you get familiar with the above correspondence, you will find it easy to assess statistical effects using Z-values. To see how, let me make one further example.

When you report a deviation of an estimate from the reference value of a quantity you are measuring, you are accustomed to employ a unit appropriate to the measurement you are performing. So, for instance, for your body weight, you could use the kilogram. Upon stepping on a scale you might utter: “I expected to weight 80 kg, but according to this goddamn thing it’s 82!” Note how this sentence is not very informative from a scientific standpoint, as it says nothing about the intrinsic precision of the instrument. Because of that, we cannot conclude that you have gained weight, as that conclusion depends on whether your scale measures weights with a standard deviation of, say, 2 kg, or rather 0.5 kg. In the former case, there is a sizable chance that your weight is actually 80 kg and the scale happened to report a measurement one standard deviation higher than the true value; while in the latter case, you have almost certainly been delusional about your body weight.

A different way to report your measurement, one immediately providing information on the significance of the discrepancy, is to express the deviation as a Z-value. This is counted in units of sigma of the Gaussian distribution which corresponds to the precision of the instrument. Again, if sigma were 2 kg, which is exactly the difference between 82 and 80 kg, then Z would equal 1 and you would coolly say “My weight measurement today returned a value one sigma higher than I expected,” which a statistics-savvy listener would not take as evidence for a weight gain. In fact, a statistical fluctuation of the measurement is a reasonable explanation of the discrepancy: a one-sigma or larger departure happens by mere chance 31.7% of the time, without the need for the true value of the measured quantity to have varied. If sigma equaled instead 0.5 kg, then 82 and 80 kg would be four sigma apart, so Z would equal 4; this could now be taken as strong indication that you have gained weight, as a four-sigma or larger fluctuation happens only 0.0063% of the time for
a Gaussian measurement. I hope the above example allows you to appreciate the usefulness of relating deviations of a measurement from an expected value in sigma units.