by Annalisa Balata
How is it possible to isolate the signal of the double Higgs production using the decay channel in which each boson decays into a pair of b-quarks? Which is the best classification model that can help us in separating the signal from background?
In my master thesis I propose a Bayesian model that can be used instead of the common classification techniques, like Random Forest (RF), Boosting Decision Tree (BDT) or Gradient Boosting (GB). In particular, I chose a Bayesian nonparametric approach.
I started out by estimating a Bayesian model in which the prior distributions of the parameters are given by a stochastic process known as Dirichlet Process. Intuitively, a Dirichlet process is a probability distribution and every extraction from it is itself a probability distribution. More formally, it is possible to define the Dirichlet Process as an infinite mixture of random variables extracted from a base distribution, named :
then , where is a point of mass concentrated in and is the beta distribution with parameters and .
That definition is known as stick-breaking and it has a simple interpretation: consider a stick of length 1 and generate a random variable . Then, break off the stick at and define as the length of the stick on the left. Subsequently, take the stick to the right, and generate , break off the stick and obtain . And so on.
In order to improve the performance of the model, I introduced interactions between variables. In a more flexible way, I also introduced nonlinear relationships between the response variable and the explanatory available variables as a new component to the model. So, I combined the Dirichlet process with a Bayesian tree (BART) and Bayesian penalized splines (P-splines) and I fitted the following model:
where is the probability of signal, is fitted by a Dirichlet Process, with atoms distributed as sum of Bayesian trees, and the functions are fitted using Bayesian -splines.
It turned out that the most performing model was the one which exclusively combines BART and -splines:
where the functions are fitted using Bayesian -splines and the ‘s are fitted using Bayesian trees.
In Fig. 1 ROC curves of the most important models are represented and in Table 1 you can see some performance indicators of the fitted model obtained on a test set: they show that the performance of the Bayesian nonparametric approach can be considered as good as the performance of the Random Forest.