by Annalisa Balata

How is it possible to isolate the signal of the double Higgs production using the decay channel in which each boson decays into a pair of b-quarks? Which is the best classification model that can help us in separating the signal from background?

In my master thesis I propose a Bayesian model that can be used instead of the common classification techniques, like Random Forest (RF), Boosting Decision Tree (BDT) or Gradient Boosting (GB). In particular, I chose a Bayesian nonparametric approach.

I started out by estimating a Bayesian model in which the prior distributions of the parameters are given by a stochastic process known as Dirichlet Process. Intuitively, a Dirichlet process is a probability distribution and every extraction from it is itself a probability distribution. More formally, it is possible to define the Dirichlet Process as an infinite mixture of random variables extracted from a base distribution, named $G_0$:

Given

$\displaystyle V_h \overset{iid}{\sim} Beta(1,\alpha) \qquad \theta_h \overset{iid}{\sim} G_0$

$\displaystyle\pi_h \sim V_h \prod_{l

then $G \sim DP(\alpha,G_0)$, where $\delta_{\theta}$ is a point of mass concentrated in $\theta$ and $Beta(a,b)$ is the beta distribution with parameters $a$ and $b$.

That definition is known as stick-breaking and it has a simple interpretation: consider a stick of length 1 and generate a random variable $V_1\sim Beta(1, \alpha)$. Then, break off the stick at $V_1$ and define $\pi_1$ as the length of the stick on the left. Subsequently, take the stick to the right, and generate $V_2\sim Beta(1,\alpha)$, break off the stick and obtain $\pi_2$. And so on.

In order to improve the performance of the model, I introduced interactions between variables. In a more flexible way, I also introduced nonlinear relationships between the response variable and the explanatory available variables as a new component to the model. So, I combined the Dirichlet process with a Bayesian tree (BART) and Bayesian penalized splines (P-splines) and I fitted the following model:

$\displaystyle \mathrm{logit}(\pi_i)=\mu_i+f_1(x_{i1})+\dots+f_p(x_{ip})+\epsilon_i$

where $\pi_i$ is the probability of signal, $\mu_i$ is fitted by a Dirichlet Process, with atoms distributed as sum of Bayesian trees, and the functions $f_i$ are fitted using Bayesian $P$-splines.

It turned out that the most performing model was the one which exclusively combines BART and $P$-splines:

$\displaystyle \mathrm{logit}(\pi_i)=f_1(x_{i1})+\dots+f_p(x_{ip})+\sum_{j=1}^mg(x;T_j,M_j)+\epsilon_i$

where the functions $f_i$ are fitted using Bayesian $P$-splines and the $g$‘s are fitted using Bayesian trees.

In Fig. 1 ROC curves of the most important models are represented and in Table 1 you can see some performance indicators of the fitted model obtained on a test set: they show that the performance of the Bayesian nonparametric approach can be considered as good as the performance of the Random Forest.