What if you have some data you want to model, but do not know anything about its parent distribution, so you have to make as little assumptions as possible? In this post, I will go through the concept of density estimation and I will play with some interesting non-parametric methods.
As I have already said a few times, the main goal of most experimental high energy physics analyses is doing statistical inference (e.g. testing hypotheses or setting model parameter confidence intervals). However, in order to do so, sometimes we intermediately deal with a different kind of statistical problem, which is estimating probability density functions given some data.
For example, you might need to model an event selection efficiency as a function of several event observables (e.g. data-driven trigger efficiency estimation) or to reweigh a set of events based on a few variables to represent a slightly different model (e.g. corresponding to an anomalous theoretical model). It is very likely, especially when working with dependences on several variables, that you are not able to decide on a certain parametrization for the probability density function (e.g. a gaussian or an exponential), so you have to resort to non-parametric density estimation methods, as the one that will be presented in this post.
In general terms, the problem of density estimation given a set of independent and identically distributed (i.i.d.) observations consists in estimating the parent probability density function
from where those samples were taken. The density estimate
can be used for subsequent inference, modelling or visualization purposes.
There are several non-parametric methods that can be used to obtain given the data, each of them can be tuned with a set of parameters which allow to balance between variance and bias of the density estimation. Therefore, to decide among all the possible density estimation alternatives and parameters, we have to define quantitatively what is a good density estimate. This is done by defining a loss function, for example the mean integrated squared error (MISE)
, but other alternatives exist (e.g. average log-likelihood ratio).
I initially stated that non-parametric density estimation methods are generally used when is not known, so how can we decide on a method and parameters if the MISE loss cannot even be computed? A possibility is to trust some asymptotic approximations that exist for some of the available methods, but a more general option is to estimate it from the data itself by cross-validation (check the slides and code at the end of the post for a more detailed example of this).
Another important concept in density estimation is that of confidence bands (i.e. assessing uncertainty on the estimate). This is again quite a challenging problem when is not known. The variance component of the uncertainty can be generally estimated by bootstrapping techniques, but in general biases are neglected, so coverage is not assured. When the quantification of the uncertainty on the density estimation is relevant for the use case, if we are to trust the uncertainty estimation a wise approach is to select smoothing parameters, so errors of
are mainly from variance (i.e. over-smoothing).

So far I have been talking about density estimation methods as abstract entities that allow to obtain given some data. Nevertheless, you have most likely used at some point of your life the simplest of the density estimation techniques: the histogram.
By binning the variable space and counting the number of observations that fall in that subspace, we can get a piecewise estimation of the probability density function. In practise, for a fixed bin setting the expected value of for a histogram is not
, but its average over the the whole bin. Therefore,
will be a biased estimator if
varies within the bin.
While the binwidth is the main smoothing parameter, will also depend on the bin origin, which is not convenient. In addition, the fact that
is not smooth and its derivatives are null or infinity render them inadequate for some applications. Last but not least, they scale extremely bad to multidimensional problems, because data sparsity might lead to many empty bins.

Another conceptually simple method for density estimation is the nearest neighbor density estimation. Instead of fixing a region of the space and counting the number of observations that fall in it, the density estimate is built from the distance to the k nearest data point , that is:
,
where k controls the bias-variance trade-off. Drawbacks of this technique include that it is also non-smooth and very sensitive to noise. In addition, the density estimation has to be explicitly normalized by dividing by its integral in the whole domain before its use.

Another method belongs to the so called kernel density estimation (KDE) techniques, which consist in building a density estimate by placing a simple function (i.e. kernel) over each data point so is constructed as:
,
where h is the smoothing parameter or bandwidth. The only requirements on the kernel function are that it is non-negative, symmetric around zero and the integral over its whole domain is one. Naively, a rectangular box of width 2h can be placed over each point, which is referred to as tophat kernel, but the resulting estimate is still non-smooth and gradients are still not nice. If a unit gaussian or other smooth function is used as a kernel, a smooth and continuous density estimation can be obtained. Sometimes it is useful to use a compact kernel to have finite distribution tails.

Before finishing this introduction to density estimation techniques, I would like to show an example of how the smoothing parameter can be tuned for any technique by directly using cross-validation on data. Removing constant terms, MISE score can be approximated by:
,
where the first term can be obtained by numerical integration and the second is the mean density estimated for each of the validation split set points evaluated using the density estimate from each of the n-fold density estimates. The optimal smoothing parameters (e.g. histogram bin widths, number of neighbors, or KDE bandwidth) can be chosen by finding the minimum of this score.

By using bootstrap, the point-wise confidence bands of can also be estimated for any density estimation method. However, it is worthy to remark again that possible biases are neglected, so coverage is not assured when they are not negligible. Over-smoothing is the way to go if we want to ensure it. In the example above, 500 bootstrap replicas have been generated (in light grey in the left image) and the confidence bands have been obtained from the quantiles at the different values of the variable.
This review of density estimation is a subset of a presentation I had to do for a Statistics and Data Analysis exam at the University of Padova. For a more detailed review of density estimation methods and examples, you can check out the presentation slides or the corresponding Jupyter Notebook in its GitHub repository, which includes all the required code and examples. Feel free to comment, correct and fork material as you please.
20 October 2016 at 15:45
Nice post Pablo, finally something really statistical! I seize the opportunity to provoke you – or more in general- the physicist community with a question: besides some minor aspects, histograms are nothing but peculiar kernel estimators with Uniform kernel functions. In fact, kernel estimators were historically introduced just to overcome some limitations related to the use of histograms (information lost due to using the center as a rapresentative of the bin, unsmoothness, choice of the bin width). Of course kernel estimation does not solve all these problems (as you also pointed out, there is no unarguable way to determine the badwidth). However, you usually earn much in using kernel instead of histograms. For these reasons, in the statistical community, histograms are just used for simple descriptive purposes and dissemination of results among non-technical readers, while they are never used for density estimation (kernel estimators are the most common practice in that context). Then the question is: why are they so pervasively used within the community of the physicists?
LikeLiked by 1 person
20 October 2016 at 17:06
Hey Giovanna, thank you for your feedback! Funnily enough, I have a draft blog polemically titled “Stop histogram abuse now!” in which I was planning to discuss and explain why histograms are so pervasively used in high energy physics and what is wrong with that. I will give a sneak peek of the main explanations I came up with to explain this phenomenon, to be possible complemented with such future post.
Histograms are awful density estimators, especially when using too large bins. Indeed, they are not estimators of f(x) but of its average in for each bin, so any behaviour within each bin is smoothed out and they can be extremely biased. In high energy physics we use them with 3 purposes: visualization, to create compressed dataset observables for fitting/tests and as density estimators. In my opinion there is a huge dangerous misuse of histograms as density estimators in HEP, no so fundamentally wrong for the other two applications.
First, I would like point out why I think histograms are so popular in high energy physics. The main factor from my point of is the design of the data processing frameworks and tools used in this field (i.e. ROOT I/O and TTree libraries). Most of HEP data analyses are based on processing large ROOT files stored on disk piece by piece and obtain simple/compressed summaries (i.e. histograms). This model was extremely good when (relevant) data was very large compared with the computer memory and trivially parallelizable just by adding up the results (also files) of the processing of several files. This paradigm does not play well with Kernel Density Estimation, Nearest Neighbors or many other methods that need fast access data for every pointwise density estimation. Nowadays, because we have large RAM memory most modern data analysis in statistics or other data-based fields is done in memory (e.g. numpy or R) or by using fast parallel processing frameworks when data is big, that is why it was so easy and fast for me to setup the examples I used in my blog post but somehow in HEP we are stuck with the old tools. The conceptual simplicity and statistical uncertainty evaluation for the histograms are also elements that contribute to its popularity.
Now regarding the 3 uses I listed, I think for visualization and simple descriptive purposes histograms are fine, especially given how well they fit in our “old school computer” analysis frameworks.
The second use case was doing tests and fits based directly on the histogram bin observables for the dataset and mixture model components instead of using directly the data samples individually. This is not conceptually wrong and in principle simplifies much the treatment for the usual use case where we have a expected histogram model for signal and background and an observed histogram for data. However, almost in any LHC analysis we have to add more model parameters to the model (i.e. other. model uncertainties, which we call systematic uncertainties) and then we have to model how expected histograms change as a model of this parameters, the usual approach is to do a 3-point interpolation called template/histogram morphing and is wrong and inconvenient. Statistical inference technologies are also a bit outdated in this field, also because of the disk-based data processing model.
The third use case is density estimation itself, which is the want dealt with in the blog post. Estimation of density ratios is very common for as an intermediate step to reweight simulations given data in the field and almost every time is done doing n-dimensional histogram ratios. This leads to poor and biased estimators, because we have to use large bins given the data sparsity and whose uncertainty coverage is not assured as discussed in the blog post.
So basically those are my thoughts for the time being on the topic, something else might come up before I finish writing the cited blog post. I really would like to know what comments do you have on them and the views of other physicist on this matters.
LikeLike
23 October 2016 at 17:15
if one wants to avoid smoothing, estimating any unknown probability density function can also be done by simulating data with estimated ML parameters from the data. Using the Kolmogorov-Smirnov test, one can repeatedly test the fit.
For example Clauset, Shalizi and Newman (2009) apply this procedure to a powerlaw. code can be found here: http://tuvalu.santafe.edu/~aaronc/powerlaws/
LikeLike