by Annalisa Balata

How is it possible to isolate the signal of the double Higgs production using the decay channel in which each boson decays into a pair of b-quarks? Which is the best classification model that can help us in separating the signal from background?

In my master thesis I propose a Bayesian model that can be used instead of the common classification techniques, like Random Forest (RF), Boosting Decision Tree (BDT) or Gradient Boosting (GB). In particular, I chose a Bayesian nonparametric approach.

I started out by estimating a Bayesian model in which the prior distributions of the parameters are given by a stochastic process known as Dirichlet Process. Intuitively, a Dirichlet process is a probability distribution and every extraction from it is itself a probability distribution. More formally, it is possible to define the Dirichlet Process as an infinite mixture of random variables extracted from a base distribution, named G_0:


\displaystyle V_h \overset{iid}{\sim} Beta(1,\alpha) \qquad \theta_h \overset{iid}{\sim} G_0

\displaystyle\pi_h \sim V_h \prod_{l<h}(1-V_l) \qquad G= \sum_{h=1}^{\infty} \pi_h\delta_{\theta_h}

then G \sim DP(\alpha,G_0), where \delta_{\theta} is a point of mass concentrated in \theta and Beta(a,b) is the beta distribution with parameters a and b.

That definition is known as stick-breaking and it has a simple interpretation: consider a stick of length 1 and generate a random variable V_1\sim Beta(1, \alpha). Then, break off the stick at V_1 and define \pi_1 as the length of the stick on the left. Subsequently, take the stick to the right, and generate V_2\sim Beta(1,\alpha), break off the stick and obtain \pi_2. And so on.

In order to improve the performance of the model, I introduced interactions between variables. In a more flexible way, I also introduced nonlinear relationships between the response variable and the explanatory available variables as a new component to the model. So, I combined the Dirichlet process with a Bayesian tree (BART) and Bayesian penalized splines (P-splines) and I fitted the following model:

\displaystyle \mathrm{logit}(\pi_i)=\mu_i+f_1(x_{i1})+\dots+f_p(x_{ip})+\epsilon_i

where \pi_i is the probability of signal, \mu_i is fitted by a Dirichlet Process, with atoms distributed as sum of Bayesian trees, and the functions f_i are fitted using Bayesian P-splines.

It turned out that the most performing model was the one which exclusively combines BART and P-splines:

\displaystyle \mathrm{logit}(\pi_i)=f_1(x_{i1})+\dots+f_p(x_{ip})+\sum_{j=1}^mg(x;T_j,M_j)+\epsilon_i

where the functions f_i are fitted using Bayesian P-splines and the g‘s are fitted using Bayesian trees.

In Fig. 1 ROC curves of the most important models are represented and in Table 1 you can see some performance indicators of the fitted model obtained on a test set: they show that the performance of the Bayesian nonparametric approach can be considered as good as the performance of the Random Forest.

Fig. 1: ROC curves of the most important models.
Screen Shot 2016-05-12 at 18.23.49
Table 1: Performance indicators of the fitted model