by Pablo de Castro

What if you have some data you want to model, but do not know anything about its parent distribution, so you have to make as little assumptions as possible? In this post, I will go through the concept of density estimation and I will play with some interesting non-parametric methods.

As I have already said a few times, the main goal of most experimental high energy physics analyses is doing statistical inference (e.g. testing hypotheses or setting model parameter confidence intervals).  However, in order to do so, sometimes we intermediately deal with a different kind of statistical problem, which is estimating probability density functions given some data.

For example, you might need to model an event selection efficiency as a function of several event observables (e.g. data-driven trigger efficiency estimation) or to reweigh a set of events based on a few variables to represent a slightly different model (e.g. corresponding to an anomalous theoretical model). It is very likely, especially when working with dependences on several variables, that you are not able to decide on a certain parametrization for the probability density function (e.g. a gaussian or an exponential), so you have to resort to non-parametric density estimation methods, as the one that will be presented in this post.

In general terms, the problem of density estimation given a set of independent and identically distributed (i.i.d.) observations $\{x_0,...,x_n\}$ consists in estimating the parent probability density function $f(x)$ from where those samples were taken. The density estimate $\hat{f}(x)$ can be used for subsequent inference, modelling or visualization purposes.

There are several non-parametric methods that can be used to obtain $\hat{f}(x)$ given the data, each of them can be tuned with a set of parameters which allow to balance between variance and bias of the density estimation. Therefore, to decide among all the possible density estimation alternatives and parameters, we have to define quantitatively what is a good density estimate. This is done by defining a loss function, for example the mean integrated squared error (MISE) $L(f,\hat{f}) = \int [\hat{f}(x)-f(x)]^2dx$, but other alternatives exist (e.g. average log-likelihood ratio).

I initially stated that non-parametric density estimation methods are generally used when $f(x)$ is not known, so how can we decide on a method and parameters if the MISE loss cannot even be computed? A possibility is to trust some asymptotic approximations that exist for some of the available methods, but a more general option is to estimate it from the data itself by cross-validation (check the slides and code at the end of the post for a more detailed example of this).

Another important concept in density estimation is that of confidence bands (i.e. assessing uncertainty on the estimate). This is again quite a challenging problem when $f(x)$ is not known. The variance component of the uncertainty can be generally estimated by bootstrapping techniques, but in general biases are neglected, so coverage is not assured. When the quantification of the uncertainty on  the density estimation is relevant for the use case, if we are to trust the uncertainty estimation a wise approach is to select smoothing parameters, so errors of $\hat{f}(x)$ are mainly from variance (i.e. over-smoothing). Histograms are the simplest density estimation method, the binwidth (or the equivalent number of bins m for a fixed range) is the smoothing parameter that balances between variance and bias.

So far I have been talking about density estimation methods as abstract entities that allow to obtain $\hat{f}(x)$ given some data. Nevertheless, you have most likely used at some point of your life the simplest of the density estimation techniques: the histogram.

By binning the variable space and counting the number of observations that fall in that subspace, we can get a piecewise estimation of the probability density function. In practise, for a fixed bin setting the expected value of $\hat{f}(x)$ for a histogram is not $f(x)$, but its average over the the whole bin. Therefore, $\hat{f}(x)$ will be a biased estimator if $f(x)$ varies within the bin.

While the binwidth is the main smoothing parameter, $\hat{f}(x)$ will also depend on the bin origin, which is not convenient. In addition, the fact that $\hat{f}(x)$ is not smooth and its derivatives are null or infinity render them inadequate for some applications. Last but not least, they scale extremely bad to multidimensional problems, because data sparsity might lead to many empty bins. Nearest neighbor method applied to a 2D toy dataset, which is a mixture of 3 multivariate gaussians. The number of neighbors kNN is the smoothing parameter  and balances between variance and bias. Estimation is nevertheless noisy and non-smooth.

Another conceptually simple method for density estimation is the nearest neighbor density estimation. Instead of fixing a region of the space and counting the number of observations that fall in it, the density estimate is built from the distance to the k nearest data point $d_k(x)$, that is: $\hat{f}(x) = \frac{k}{2nd_k(x)}$,

where k controls the bias-variance trade-off. Drawbacks of this technique include that it is also non-smooth and very sensitive to noise. In addition, the density estimation has to be explicitly normalized by dividing by its integral in the whole domain before its use. Kernel density estimation for a simple 1D benchmark problem using gaussian and epanechnikov kernels and different bandwidths. Example adapted from the scikit-learn documentation.

Another method belongs to the so called kernel density estimation (KDE) techniques, which consist in building a density estimate by placing a simple function (i.e. kernel) over each data point so $\hat{f}(x)$ is constructed as: $\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K \left ( \frac{x_i - x}{h} \right )$,

where h is the smoothing parameter or bandwidth. The only requirements on the kernel function are that it is non-negative, symmetric around zero and the integral over its whole domain is one. Naively, a rectangular box of width 2h can be placed over each point, which is referred to as tophat kernel, but the resulting estimate is still non-smooth and gradients are still not nice. If a unit gaussian or other smooth function is used as a kernel, a smooth and continuous density estimation can be obtained. Sometimes it is useful to use a compact kernel to have finite distribution tails. Data-driven gaussian kernel bandwidth optimization using CV MISE for a toy benchmark dataset of 1000 observations. The optimal bandwidth found is about h=0.45.

Before finishing this introduction to density estimation techniques, I would like to show an example of how the smoothing parameter can be tuned for any technique by directly using cross-validation on data. Removing constant terms, MISE score can be approximated by: $CV \ MISE \approx \int \hat{f}^2_{\mathrm{CV}}(x)dx- \frac{2}{n} \sum_{i=1}^{n_{\mathrm{CV}}} \hat{f}_{\mathrm{CV}}(x)$,

where the first term can be obtained by numerical integration and the second is the mean density estimated for each of the validation split set points evaluated using the density estimate from each of the n-fold density estimates. The optimal smoothing parameters (e.g. histogram  bin widths, number of neighbors, or KDE bandwidth) can be chosen by finding the minimum of  this score. Bootstrap density estimates (left) and pointwise 1 sigma and 2 sigma confidence bands built from the bootstrap sample corresponding quantiles (right). The blue line is the density estimation using the original data, while the red dashed line is the true density.

By using bootstrap, the point-wise confidence bands of $\hat{f}(x)$ can also be estimated for any density estimation method. However, it is worthy to remark again that possible biases are neglected, so coverage is not assured when they are not negligible. Over-smoothing is the way to go if we want to ensure it. In the example above, 500 bootstrap replicas have been generated (in light grey in the left image) and the confidence bands have been obtained from the quantiles at the different values of the variable.

This review of density estimation is a subset of a presentation I had to do for a Statistics and Data Analysis exam at the University of Padova. For a more detailed review of density estimation methods and examples, you can check out the presentation slides or the corresponding Jupyter Notebook in its GitHub repository, which includes all the required code and examples. Feel free to comment, correct and fork material as you please.