by Andrea Giammanco

In the first episode of this series of posts I tried to convey the simplicity of the unfolding problem; in the second I warned against the dire consequences of treating it a bit too simply.

This third post is about a few of the methods that are most popular in experimental HEP (although usually originated outside of HEP) to address the problem.

First of all, a reminder of the problem that we are trying to solve:

\displaystyle M\times\vec{t}=\vec{x}

where \vec{t} is The Truth that we seek to infer, while \vec{x} is the histogram of actual data that we observe. The latter is affected by all kinds of biases and several stochastic processes (statistical noise, finite resolution, uncertainty on the knowledge of the background that we need to subtract from the data). M is the “migration matrix” that connects one to the other, i.e., the element M_{ij} of this matrix is the probability to observe an event in bin number i if its true value belongs to bin number j. The migration matrix is obtained from simulated events, based on some model of the true distribution.

(Here for simplicity, differently from the formulas used in the previous two posts, I am calling M what was called A \times M there, and I am calling \vec{x} what used to be \vec{x}-\vec{b}. I hope that this will make the reading a bit less heavy for you and carpal tunnel syndrome a bit less likely for me.)

Given that in many problems the migration matrix is not a square matrix, and therefore inversion is not possible (no matter how ill- or well-behaved is that matrix), usually you don’t really perform a matrix inversion but rather a fit. It can be a least squares fit, or a maximum likelihood fit (usually the latter), for the purpose of this post it doesn’t matter.

The point is that you want to minimize the distance between M\times\vec{t} and \vec{x}, in some metrics.
In a compact and generic form, you want therefore to find the vector \vec{t}_{unfold} that minimizes this quantity:

\displaystyle \parallel M\times\vec{t}_{unfold}-\vec{x} \parallel^2

where the symbol \parallel \bullet \parallel indicates the “distance” in the metrics of your choice.

It sounds quite different from inverting a matrix, but in fact it is not. Matrix inversion is guaranteed to give you the maximum likelihood estimate when the matrix is indeed invertible and all elements of the resulting unfolded vector are positive; see slide 17 of Mikael’s slides. The same conditions that make a matrix well- or ill-behaved make the fit results stable or unstable.

But by approaching the problem as a maximum likelihood fit problem, we can see what happens by introducing some additional term (“regularization term”) to the likelihood. This is what is done by the methods based on Tikhonov regularization.

Tychonoff
Andrey Nikolayevich Tikhonov

Side note:
As often happens in science, the same method has been discovered several times by different people in different fields.
In this post, and in all my future posts of this blog, I will follow the conventions used in HEP. And in HEP, this methodology is named “Tikhonov regularization”. Wikipedia informs me that to give due credit it should be named Tikhonov-Miller, or Phillips-Twomey, and I am sure that some readers may want to jump on the occasion to smart-ass a bit by informing me of the name the technique has in some other field.
End of side note.

Let’s then write:

\displaystyle \parallel M\times\vec{t}_{unfold}-\vec{x} \parallel^2 + \tau\parallel \Gamma\times\vec{t}_{unfold} \parallel^2

where \tau is a small constant term (small because the extra term is not supposed to dominate the solution) and \Gamma is a matrix that we will try to construct such to come to the rescue in cases like the one we saw in my previous post:

cowan

In this scheme of things, the name of the game is choosing a reasonable “regularization matrix”.

Let’s look again at the figure above, what is really striking about the third plot?

Striking fact #1: in the third plot the variance of each bin is huge (orders of magnitude larger than the bin contents in the first and second plot), despite the fact that the author applied a maximum likelihood estimator, which is guaranteed to give the smallest possible variance for an unbiased estimator.

That’s an indication that we should give up with the usual wish to avoid (or at least minimise) bias. Maybe, after all, bias is not so bad, if in the end the quadratic sum of bias and standard error is a reasonable number (and if you have ways to estimate the bias and account for that as an additional error component).

We want to trade-off bias versus variance. This can be made in many ways. Which ways do most people recommend?

One of the popular regularization methods takes \Gamma as the unit matrix (a diagonal matrix whose diagonal is filled with 1’s), so that the penalty factor is just proportional to the norm of the unfolded vector (this penalizes large absolute values of the bin contents), but this would not help with the specific problem of this particular figure.

Striking fact #2: that histogram is sort of “oscillating”, each bin seems to be anti-correlated with its neighbours. No matter how little you know about statistics, I am sure that the first time you looked at the third plot you understood that “it was wrong”, just because of this funny feature.

The reason for these anti-correlations, in the example at hand, is that the resolution is poorer than the bin width. This means large non-diagonal elements of the migration matrix. We have seen in the previous post that this leads to large numerical corrections from the smeared space to the unfolded space, hence large sensitivity of the unfolding to small variations, hence numerical instability.

Another way to see that, thinking about detector resolution versus bin width, is the following ideal example: if your resolution is such that two smeared-space bins are equally populated by the same unfolded-space bin, any over-correction of the first bin will get compensated by an under-correction of the second bin, and viceversa.

A trivial “solution” is to play with binning such to ensure that the migration matrix is well-behaved. But binning is, intrinsically, a loss of information, and the larger the bins the less information you get. Remember (from the incipit of my first post) that the entire point of going differential is that we seek to squeeze more information from the data than just measuring inclusively.

So in practice we always look for a trade-off between power of discrimination between models (the more bins the better) and stability of the unfolded result (the more diagonal the better, in terms of stability, hence less and wider bins are better).

(Note that there is at least one paper advocating the opposite solution, i.e., making the bin size very small and smoothing only after unfolding. The price to pay is that one has to use more advanced non-parametric density estimation techniques for constructing the transfer function. I am not aware of LHC analyses following this approach yet.)

To avoid the striking fact #2, one can regularize based on the “derivatives” of the unfolded distribution. (Well, derivatives of a discrete distribution do not exist, but are approximated by differences.)

Regularizing on the first derivatives of \vec{t}_{unfold} means minimizing the difference between adjacent bins. That would damp the oscillations of the third plot of the figure above.
But the inversion may have also induced oscillations with lower frequency (i.e., not a wild jump from one bin to the next, but a more subtle oscillating trend), so sometimes one wants to regularize on the higher-level derivatives. Regularization on the second derivative means damping any concavity in the unfolded distribution, which is usually sufficient.

If you are familiar with electronics, the following analogy may resonate with your chords: most smearing effects act as low-pass filters (they turn narrow features of the signal into broader ones), and therefore the trivial matrix inversion is effectively a high-pass filter, which is known in electronics to amplify noise. Regularization is the damping of the highest-frequency modes.

To implement that in matrix form, you may verify that the following matrix:

\displaystyle \Gamma = \left( \begin{array}{cccccc} +1 & -1 & 0 & 0 & \cdots & \\ 0 & +1 & -1 & 0 & \cdots & \\ 0 & 0 & +1 & -1 & \cdots & \\ & \vdots & & & \ddots & \\ & & & 0 & +1 & -1 \end{array}\right) ,

when inserted in the “penalty term” \parallel \Gamma\times\vec{t}_{unfold} \parallel^2 yields the desired property of damping the differences between neighboring bins.
And “curvature” (related to the difference between a bin and both its neighbors, and correlations between next-to-neighboring bins) is minimized by using:

\displaystyle \Gamma = \left( \begin{array}{cccccc} -1 & +1 & 0 & 0 & \cdots & \\ +1 & -2 & +1 & 0 & \cdots & \\ 0 & +1 & -2 & +1 & \cdots & \\ & \vdots & & & \ddots & \\ & & & +1 & -2 & +1 \\ & & & 0 & +1 & -1 \end{array}\right) .

It all sounds very nice, but it comes with a cost. You are imposing a very reasonable condition by assuming that there are no high-frequency features in the Truth vector, but what if there are high-frequency features that you did not anticipate? They would be just washed away by the unfolding procedure.

One such example is a bump over a smooth distribution, which may originate from an unexpected resonance. And the same resonance may also give a peak-dip structure (i.e., the distribution rises up from the background, and then undershoots the background, before going back to the baseline; this tends to happen when the process with an intermediate resonance has quantum interference with the Standard Model background process), i.e., exactly the kind of structure that Tikhonov regularization has been designed to wash away.

Note that this is not a dramatic issue, in general: before unfolding, one is not blind to what the smeared-space data are telling you. If a bump or any other striking feature is apparent in the smeared space, you quantify its significance in the smeared space and study the properties of the events in the anomalous region without bothering about unfolding. You unfold when it makes sense to unfold, which will be the topic of the next post of this series.

However, it is true that one has always to carefully find a trade-off for the value of \tau considering that if too small it is useless and if too large it biases \vec{t}_{unfold} towards the model \vec{t}_{model} , making us blind to the unexpected. Several objective criteria are used to choose the optimal value of \tau , but I will not go into these details unless somebody asks.

The most remarkable feature in Mikael’s regularization method, as explained in his post, is that physics prejudice is explicitly and transparently applied: assumptions may be formulated like “the distribution of process X, by first principles, must be positive-defined, monotonous and convex”, or other similar shape constraints. This may result being a tighter constraint than avoiding high-frequency features in the distribution, but the bias may be easier to identify, and statistical coverage properties seem to be better-behaved (i.e., your error bars mean what you usually intend them to mean), as Mikael’s work proves.

I elaborated at length so far about Tikhonov regularization, but that’s just one of the methods in use in HEP. Another popular method is the D’Agostini Iterative method, as proposed here. Like other iterative methods, it requires to start from a reasonable assumption (\vec{t}_{model} ) to get a first estimate \vec{t}_{unfold,~step~0} , then use this as the next reasonable assumption to get a new estimate, and repeat the procedure at each step until (hopefully) convergence is achieved, i.e., executing further steps has no visible effect on the result.

Thomas_Bayes
Thomas Bayes

What is specific of the D’Agostini method, compared to other iterative methods, is the usage of Bayes’ theorem:

\displaystyle P(t_i | x_j) = \frac{P(x_j | t_i)P(t_i)}{\sum_k P(x_j|t_k)P(t_k)} .

The denominator of this formula comes from the acceptance matrix A , while P(t_i | x_j) is the migration matrix M .

In practice, each step proceeds as follows:

  1. Choose the initial distribution \vec{t}_{model} ; this gives the prior probability density for each bin i of this vector, P^{step~0}(t_i).
  2. From that, calculate the expected event counts bin by bin (i.e., apply smearing based on the A  and M matrices estimated from simulation), \vec{n}^{step~0} .
  3. Use this estimate as new model: calculate P^{step~1}(n_i) and redo the previous step.
  4. Calculate a χ2(or any other measure of distance) between \vec{n}^{step~m} and \vec{n}^{step~m+1} .
  5. Continue until this distance is “small enough”.

The exact objective criterion chosen to stop the iterations acts as an implicit “regularization” of the problem. In some variants, intermediate smoothing may also be applied during the iteration.

I will not elaborate here on the relative pros and cons of iterative methods versus Tikhonov regularization, but it is important to point out that both approaches share the same conceptual weakness that makes several people critic of unfolding in general: no matter what you do, your optimal choice (for stopping the iterations, or for regularizing the problem) biases your results to your initial expectations.

Saying “D’Agostini Iterative” is, again, mere HEP jargon. In astronomy and optics it may be more known as “Lucy-Richardson deconvolution”. It is also common, in HEP papers, to call it “Bayesian unfolding” (because it is based on Bayes’ theorem), but Mikael explains nicely (slide 19 here) that there is nothing Bayesian, in the common sense of the term, about that.

In fact, a truly Bayesian method has been proposed in the literature, in this paper. The author is a HEP experimentalist, member of the ATLAS collaboration, and in fact some ATLAS results have been already produced with this method (while it doesn’t seem to have been tried in my own collaboration yet.)

This method requires no iteration. To play the role of the regularization is the Bayesian prior (for which a few reasonable choices are discussed in the paper) of the probability density function. The full posterior of the probability density function is the real outcome of the unfolding here, differently from the previously discussed method whose outcome is an estimator plus a covariance matrix (whose diagonal elements are the variances of the unfolded bins, and the off-diagonal ones allow to consider the bin-to-bin correlations.)

If more than one answers are equally likely, this method reveals all of them (as multiple maxima in the posterior density), while iterative unfolding converges towards one of the possible solutions (ideally the strongest maximum, although that’s not even guaranteed.)
Needless to say, also with this method any unexpected features of the data are damped in the unfolded space with respect to the smeared space (there is just no way to avoid that.)

In the next episode of this series, where I will use much less mathematics, I will elaborate about when it makes sense, or not, to unfold.

Image credits: