by Juan Rojo
Juan Rojo is a member of the Oxford University node of the AMVA4NewPhysics network. In this post he presents a recent study carried out by a group of theoretical and experimental physicists of the Oxford University.
Given the advanced nature of the topic, all readers are highly invited to ask for any kind of clarification and explanation in the comments’ section. Juan will be happy to answer all your questions!
The measurement of double Higgs production will be one of the central physics goals of the LHC program in its recently started high-energy phase, as well as for its future high-luminosity upgrade (HL-LHC).
Higgs pair production is directly sensitive to the Higgs trilinear coupling and provides crucial information on the electroweak symmetry breaking mechanism. It also probes the underlying strength of the Higgs interactions at high energies, and can be used to test the composite nature of the Higgs boson.
Analogously to single Higgs production, in the Standard Model (SM) the dominant mechanism for the production of a pair of Higgs bosons at the LHC is gluon fusion, see Fig. 1 for representative leading-order Feynmann diagrams. For a center-of-mass energy of √s = 14 TeV, the next-to-next-to-leading order (NNLO) total cross section is approximately 40 fb, so therefore around 12000 events with Higgs pairs are expected per LHC experiment at the end of Run II, and a factor 10 more at the end of the HL-LHC.
Feasibility studies in the case of a SM Higgs boson in the gluon-fusion channel at the LHC have been performed for different final states, including bbγγ, bbτ+τ–, bbW+W–, and bbbb. The main advantage of the bbbb final state is the enhancement of the signal yield from the large branching fraction of Higgs bosons into bb-pairs, BR(H→bb)≅0.57. On the other hand, a measurement in this channel needs to deal with an overwhelming QCD multi-jet background. Previous studies of this final state estimate that, even at the HL-LHC, it will be very difficult to observe Higgs pair production.
Very recently, within a collaboration of theorists and experimentalists from Oxford, we have revisited the feasibility of SM Higgs pair production by gluon-fusion in the bbbb final state at the LHC in a new study. The authors include members of the Oxford node of the AMVA4NP network, Daniela Bortoletto, Cigdem Isserver and myself, as well as three postdocs, Katharina Behr (who just moved to DESY), James Frost, and Nathan Hartland.
In our analysis, the selection is divided into three different categories, depending on the event topology: the resolved (with four well separated b-jets), boosted (with two fat jets containing each the decay products of a Higgs boson) and intermediate category. These three categories are optimized separately and then combined.
There are several improvements compared to previous works, including a detailed simulation of the background contamination from light jets mis-identified as bottom-quark jets, and the assessment of how the high pileup (number of collisions per bunch crossing) conditions expected at the HL-LHC degrade the results.
From the methodological point of view, the main difference is that our analysis is based upon a combination of traditional cut-based methods and multivariate analysis (MVA), in particular Artificial Neural Networks (ANN). Multivariate techniques are by now a mature tool in high-energy physics data analysis, opening new avenues to improve the performance of many measurements and searches at high energy colliders. In particular, the classification of events into signal and background processes by means of MVAs is commonly used in LHC applications.
The specific type of MVA that we used in our work is a multi-layer feed-forward artificial neural network (ANN), known as a perceptron, and also sometimes as deep neural network. In Fig. 2 you see an illustrative example of one of the ANNs used, with a total of Nvar = 21 input variables. This type of ANN is the same as the ones used to parametrize the parton distribution functions (PDFs) of the proton in the NNPDF global analyses, of which myself and Nathan are also members.
The MVA inputs are a set of kinematic variables describing the signal and background events that satisfy the requirements of the cut-based analysis. The output of the trained ANNs also allows for the identification, in a fully automated way, of the most relevant variables in the discrimination between signal and background.
The training of the neural networks consists in the minimization of a suitable figure of merit, in this case the so-called cross-entropy error function, to maximize the discrimination between signal and background events. This training is performed using Genetic Algorithms (GA), non-deterministic minimization strategies suitable for the solution of complex optimization problems, for instance when a very large number of quasi-equivalent minima are present.
GAs are inspired by natural selection processes that emulate biological evolution. To avoid the possibility of over-fitting, we used a cross-validation stopping criterion. This cross-validation proceeds by dividing the input Monte Carlo (MC) dataset into two disjoint sets, and to use one of them to train the ANN and the other one for validation: the optimal stopping point is then given by the minimum of the error function to the validation sub-sample. This indicates the point where the ANN begins to train upon statistical fluctuations in the input MC samples, rather than learning the underlying (smooth) physical distributions.
In Fig. 3 we show the distribution of the ANN output at the end of the GA minimization, in the case of the boosted selection. The separation between signal and background is achieved by introducing a cut, ycut, on the ANN output, so that MC events with yi ≥ ycut are classified as signal events, and those with yi < ycut as background events. Therefore, the more differentiated the distribution of the ANN output is for signal and background events, the more efficient the MVA discrimination will be.
As we see, the algorithm achieves a very good separation between the two types of events. The main results of our study for the case of the HL-LHC are collected in Table 1, where we show the signal significance, S/√B, and signal-over-background ratio, S/B, before the MVA is applied (ycut = 0) and after the optimal MVA cut is applied in each category.
The most remarkable result is the substantial improvement in signal significance when going from the purely traditional cut-based analysis to the final results including also MVA: for example, in the resolved category the significance increases from 0.4 to 2.0. Also very important, the signal over background ratio is increased by two orders of magnitude.
To summarize, multivariate techniques have the potential to improve the signal significance for processes with complicated final states, such as hh→4b, as compared to traditional cut-based analyses. Our study not only illustrates how the 4b final state should be enough to observe Higgs pair production at the HL-LHC, but, even more remarkably, demonstrates that, provided the signal selection efficiency and background rejection can be improved, there might be even some hope for Run II.
However, ours is only a phenomenological feasibility study: the real challenge, the actual measurement of the hh→4b process by ATLAS and CMS, will take several years. But at least we have shown that we have many reasons to be optimistic!
(Written by J. Rojo)
13 January 2016 at 19:44
Dear Juan,
Congratulations on a very interesting paper, which I have just finished reading. I was wondering if you would be so kind as to answer a few questions regarding the MVA part of the analysis. I’ll try to pose my questions using non-technical language, for the benefit of other visitors to this blog.
* What software did you use for the ANN? TMVA, or something else?
* Regarding section 5.2, how did you choose the (21) input variables for the ANN? Did you make this choice before, say, inspecting histograms of the variables in the data used for the analysis? Or, did you choose the best variables after exploring the data?
* Did you apply any transformations to the input variables before they were input to the ANN?
* I understand that N_gen, the number of generations, is one of the hyperparameters of the ANN. What other hyperparameters did you tune? Did you experiment with different ANN architectures? Activation functions?
* I understand that the optimal stopping point was tuned by splitting the data into two disjoint data sets; a training data set and a hold-out cross-validation data set; during training using the training data you tested the performance of the ANN by evaluating the cross-entropy on the validation data. I’m assuming that you took all the data you had and partitioned it only once into these two disjoint sets — correct? Once you have determined the optimal stopping point (which minimised the cross-entropy), was the corresponding ANN that was trained on the training data and resulted in the minimum cross-entropy the final model selected? Or did you re-train on training+validation data to yield the final model? And the final results were obtained with this ANN?
* I understand that you tuned y_cut to maximise S/sqrt(B). On which data exactly?
* Once you have trained the ANN and tuned y_cut, what data was used to obtained the final results, such as the ROC in figures 17 and 20?
* Once you had the final performance results, did you revisit the data and try anything else to improve on these results?
Thanks!
Best regards,
Andrew.
LikeLike
13 January 2016 at 21:49
Hi Andrew,
Thanks for your interest in our paper. I try to reply to your questions, also limiting to a minimum the technicalities.
* The ANN software that we use is our own code, we don’t use any external packages. This code was extensively developed and tested in the context of the NNPDF fits, so it was relatively straightforward to adapt it to the di-Higgs study. Also the Genetic Algorithms minimisation code was written by ourselves. The code itself is written in C++ and Python.
* The choice of variables was made basically by considering all possible kinematical variables that might contain some discrimination information. We knew already that some variables had more and others less discrimination power, but one of the advantages of the ANNs is their redundancy, so it does not hurt to include as many variables as possible.
* No special preprocessing was applied to the input variables. As usual when training ANNs, each input variable was rescaled to be between 0.1 and 0.9 in a linear scale. Actually this is a good suggestion, we might have tried dedicated preprocessing such as including log(pt) instead of pt or something similar. But we did not find the need for these sophistication.
* Yes, we verified different architectures, for example, different number of hidden layers, but the results were essentially unchanged. Another hyperparameter is given by the settings of the cross-validation stopping – again in this case results shows very good stability
* Yes, the final ANN is the one trained only on the training subset. It is not possible to use the validation for anything to do with the training, since then the cross-validation stopping method cannot be used. In any case this causes a minor loss of information: since the training samples are from monte Carlo event generators, one can simply increase the size of the MC sample by a factor 2. Of course when training the ANN to real data this is much more delicate, because there the information loss is real. What can be done there is select different random training/validation partitions, train a ANN in each partition, and then take the average of the results.
* Concerning the last three questions: once the have the ANN trained, we can use it in our entire Monte Carlo sample to determine S/sqrt(B) etc, no need to separate between training and validation anymore. And as usual with MVAs, there is always room for improvement, actually we are already working on a follow-up publication in this respect.
Hope this helps!
Best,
Juan
LikeLiked by 1 person
14 January 2016 at 19:31
Hi Juan,
Thanks for the prompt reply! I appreciate you taking the time to answer my questions. If I may, I’d like to ask for clarification on some of the points in your last message. I’ll number them for easy reference later, and I’ll try to be brief.
1. This concerns my last three questions in my original message. I’d like to be sure that I understand exactly the procedure that you used in the MVA part of the analysis. I understand that you took the entire MC sample and split it into two disjoint sets; a training set and a validation set. You trained the ANN on the training set and monitored the performance using the validation set. When the cross-entropy is at a minimum, this is the stopping point. You stop training, and the resultant ANN is retained and used to obtain the rest of the results in the paper. Once you have the trained ANN, you merge the two subsets of data, thereby using the entire MC sample to determine the value of y_cut that maximises S/sqrt(B). (Please let me know if I misunderstood anything; all OK up to this point?)
Question: do you also use the entire MC sample to get the results that are reported in the paper? (And how did you partition the set into a training and validation sample: 50/50, or some other split?)
2. Is the protocol you used, as described above, the same as was used in the NNPDF work?
3. You wrote that you centered and scaled the input variables prior to ANN training. I assume you did something like X’ = (X – mean(X)) / DeltaX, where DeltaX = xmax – xmin. (This is probably a gross simplification.) Did you apply the transformation before or after splitting the data into training and validation samples?
4. Is your code publicly available? If not, when do you plan to make it available to the HEP community? (It would be interesting to see how it compares, in terms of speed and performance, with other ANN implementations out there, like TMVA and so on.)
I look forward to reading your next paper on this topic!
With many thanks,
Andrew.
LikeLike
14 January 2016 at 19:50
P.S. Question 5: Did you train on the Grid, using a cluster, or on a single machine? Just curious about what hardware you used. How long did training take?
LikeLike
1 February 2016 at 14:55
Dear Juan,
I was wondering whether you have a moment to consider the questions that I posted on 14 January. If you could spare the time to answer, I would be very grateful, as I am eager to read your response.
With many thanks,
Andrew.
LikeLike
8 September 2018 at 3:13
This site was… how do you say it? Relevant!! Finally I’ve found something that helped me.
Many thanks!
LikeLike