A few days before I returned from CERN at the beginning of the month, I attended a talk on the upcoming TrackML challenge. This is a competition beginning this month in which members of the public will be invited to try and find a solution to the quite tricky problem of accurate reconstruction of particle trajectories in the collisions at the LHC. The various detectors simply record the hits where particles pass by, however to make use of this data, the hits in surrounding detector layers must be combined into a single flight path, called a track.
Currently this is done by a rather slow algorithm, and after upgrades in the near future, the amount of tracks to reconstruct will be vastly increased. Physicists have tried many approaches and are now looking for experts outside the field to attempt to find a solution.
This challenge follows the HiggsML challenge on Kaggle; a similar idea from back in 2014 when simulated data for particle collisions was made public and entrants competed to use machine learning to provide classification between collisions in which a Higgs boson was produced, and ones where it was not.
The competition proved to be very popular with about 2000 people competing in total. There was $13,000 in cash prizes and invitations to speak at CERN for the top places. The famous XGBoost BDT implementation was also written to help solve the problem.
I’ve done a few Kaggle competitions in the past, and always find something new, or solve some problem. Since I’d yet to attempt the HiggsML challenge, I thought it would be good practise for getting ready for the tracking competition.
The competition involves training a classifier on a provided training set for which labels of ‘signal’ and ‘background’ are provided. The classifier is then applied to an unlabeled testing set, which is then scored by Kaggle. The performance is ranked by the Approximate Median Significance (AMS) – an approximation of the power of the classifier to discover the Higgs boson.
During the competition, only one’s public score was visible, which was scored on 20% of the test data. Once the competition was over, the private rankings were shown (scored on the remaining 80% of the data). Since I was competing after the competition, I was able to see both my public and private scores instantly, however I would only be taking inference from the public score.
Day 1 – Initial investigations
My default approach for classification problems is to assign 20% of the labelled data for validation (testing my model to avoid overfitting to the testing set). Since the computation of the AMS requires a cut on the classifier prediction, I would also be optimising the cut on the validation data, too.
Next I train in 10-fold stratified cross-validation on the training data. I then save each classifier, weight its predictions according to its performance on the testing fold, and combine all 10 into an ensemble.
This time round, however, I was trying out two new techniques to deal with the learning rate of my neural networks, both of which I picked up from the FastAI deep-learning course: a learning-rate finder, and a way of decaying the learning rate during training called cosine annealing.
To begin with, I wrote a quick script to import the data, plot some distributions, and normalise and standardise the features (with the option of PCA decomposition). I then spun up some virtual machines and quickly tested out a few different batch size, learning-rate, and activation function combinations, all using the same basic architecture of 4 layers of 100 nodes, optimised via ADAM.
Eventually, I decided on a batch size of 256, a learning rate of 10-3, a cosine multiplicity of 2, and a swish activation function. I then selected the cut on the validation data which maximised the AMS, processed the testing data, and submitted the score coming in at 816th on the private leadership board and 909th on the public one, out of 1785 teams, which I thought was pretty respectable.
Day 2 – Weight and rotate
The training data also includes the ‘weights’ of the events – their likelihood of occurring in a real collision. Ideally one wants to train such that higher weighted events are more carefully classified than lower weighted events. In a similar problem in my real work, I found that training on these weights spoiled the classifier completely, so for Day 1 I had ignored them.
When I finally tried training on them, I got very similar results to what I had already seen in the other problem. Thinking this was odd, I dug a bit further in and found that Keras (the library I use to implement the networks) doesn’t treat the sample weights in the same way a physicist would: I ask the network to normalise the classes in the training data automatically, and I presumed that this would mean that the individual weights were normalised. However, they actually get treated separately as a kind of ‘degree of trust in the data point values’.
This meant that the sample weights weren’t being renormalised and the classifier was completely ignoring one class, whose sum of weights was much lower than the other’s. When I manually rescaled the weights, I suddenly got a much better performance, and solved a problem with my other work, too! This brought me up to 553:436 on the private:public LBs.
So far I hadn’t done too much pre-processing of the data beyond normalisation, standardisation, and PCA (although PCA was found to actually reduce performance). One of the peculiarities of high-energy physics data is the rotational symmetry in one of the directions. By rotating all the data to have a common alignment, I was able to reduce the feature-space by one, and avoid the NN having to learn this symmetry. I also transformed the coordinate system from the default, non-linear one, to a more simple Cartesian system.
With just these changes I ended the day at 313:334.
Day 3 – Jets on a plane
Day 3 technically covered several days, since I was flying back from CERN and attended a school, but the main focus of the work done was the same, so I’ll just consider it one day in total.
Within the data there are features regarding jets that happen to be in the event. They report the three-vectors of the two hardest jets, and a few high-level features based on angles and invariant masses of the jet pairs. Some events, however, only contain one jet, or none at all, and so these features are assigned a default value of -999.0 in the data.
I was wondering how exactly these are interpreted by the NN during training and whether it was making optimal use of the information there. I decided to try splitting the data in categories of number of jets: zero, one, and two or more. I could then train a separate classifier on each category, only feeding in the required information.
I did this, and chose separate cut values for each category on the validation data. This approach showed the overall highest AMS on the validation data I’d seen, however when I applied it to the testing data and submitted the score to Kaggle, the result was the worst of my submissions.
I found this to be quite strange, and rechecked my code, compared the distributions of validation and training data, even running 2-sample Kolmogorov-Smirnov tests and found them to be compatible.
I tried a few more times and still was getting bad results and eventually gave up on the approach. Perhaps the splitting of the training data reduced it to the point where I was unable to train decent classifiers, however I would have thought that this would be reflected in the validation result, too. Or maybe the validation and test data were generated with slightly modified settings, and the splitting into jets allowed the method to be too reliant on the training settings. I’m not sure, but I’d be interested to find out.
Day 4 – Considering scale
Reverting back to my initial approach of training on the full data set, I tried to remove the validation sample to see whether my classifier was being limited by the size of the training data; if this were the case, then the increased training sample should yield a higher ROC AUC than I’d otherwise seen on the training data. The new training, however, showed no improvement over the previous training, so I was able to conclude that the 80:20 split of the training data was fine. The use of a validation set also made it easier to pick a decent cut for classifying events. Although another approach would be to pick the cut on the testing-fold during cross-validation, and then take the mean, this would be sub-optimal once the ensemble was constructed.
Whilst I’d found that splitting by jets was bad, I was still concerned about the default -999.0s in the data. In particular, how they would be affecting the normalisation and standardisation. I instead replaced most of them with values of -1 for the high-level features, which would be of similar scale to the rest of the feature.
This proved to be useful and brought my ranking to silver medal position at 79:200 on the private:public leaderboard.
Day 5 – Surgical cuts and coarse examinations
One of my remaining concerns was the fact that I was optimising the cut value for the AMS on a single data-sample, which might be subject to statistical fluctuations. Indeed, by moving the cut value by slight amounts I was able to affect the score on the testing data by quite a bit. Since I wanted to avoid overfitting to the public leaderboard (if I were competing at the time, the score on the private board would not be available to me) I needed to develop some way of more precisely choosing the optimum cut value.
I initially tried using a pseudo-bootstrap approach around the optimum AMS region; calculating the AMS value if the cut were placed at each data-point, selecting the data which had the highest AMS, and then sampling without replacement a 10% subsample 1000 times, computing the optimum cut on each, and then taking the mean. Sadly this didn’t work as planned and returned worse cuts than using the entire sample.
The next approach was to split the sample evenly into folds, recompute the AMSs for each fold sample, find the optimum cuts, and then take the mean. This proved better, but ended up overestimating the cut slightly. Looking at the signal and background weights in each fold, they appeared to fluctuate quite a bit. Rescaling them to match the weights of the whole sample helped a bit. Computing the mean by weighting according to the inverse of the maximal AMS of each fold proved to be the final step.
Using this approach for the best Day-4 classifier I reached 26:103, which I was quite pleased with. Spoiler alert: this is the best I get, but I’ll describe a few of the other approaches I tried. The private:public AMSs were 3.73439 : 3.69002.
Aside from some quick tests of activation functions on Day-1, I so far hadn’t tuned the network architecture, using a default one of 4 layers of 100 nodes. To end the day I ran a few variations with different number of layers and widths, but none were able to do any better. Admittedly, this was just a coarse test and sometimes I wouldn’t even let them finish training, so it’s possible that with a more thorough hyperparameter optimisation, a better architecture might be found, however it didn’t seem to be the limiting factor at the moment.
Day 6 – Regularisation is futile
I also had not added any regularisation aside from early-stopping; in the past I’ve not found dropout, L1, or L2 penalties to be particularly useful. The ROC AUC on my training and validation sample showed a difference of 0.5%, so I didn’t think that it would be necessary to add further regularisation. However, this is the performance of the ensemble, not the individual classifiers (where the difference would be greater), and I hadn’t yet tested whether my new learning-rate approaches interact well with the dropout. Similarly, I hadn’t tried using batch normalisation, so this was another thing to test.
There’s some talk about whether batch normalisation should go before or after the activation function. The paper places it before, but lots of people report better performance placing it after. For me, neither positions gave any improvement. Testing three rates of dropout also resulted in no improvements.
Day 7 – Accomplish nothing with this one weird trick
By this point I was beginning to run out of ideas and starting to suspect that improvements were going to require fine-tuning of the hyperparameters (time-consuming and not something I’d learn much from), or some problem-specific feature engineering (might be a source of experience, but would have limited transferability).
I thought I’d try one last possibility: The signal process is Higgs to di-tau and one difference between the signal process and the background processes is that the mass of the di-tau should be a peak at 125 GeV, whereas for background it should be a peak at 91 GeV, or flat. The difficulty is that the taus decay to produce neutrinos, which cannot be reconstructed at ATLAS. This means that the mass cannot be precisely reconstructed.
ATLAS and CMS both have algorithms which fit the mass under the hypothesis of di-tau, in order to improve the resolution of the di-tau mass, and this improved mass was included in the challenge data. I am currently involved in work to replace the CPU-intensive fitting approaches with much quicker neural-network-based regressors, and results are already providing superior resolution. I was wondering, then, whether or not I could apply a regressor to the data in order to achieve a more discriminant mass feature.
There were a few problems: the data is simulated according to the ATLAS detector, and I work in CMS (so don’t have access to detector-specific training samples (and even if I did, it wouldn’t be in the spirit of the competition to use them)); the energy and masses of the final-states are missing from the challenge data, which would limit the performance of a regressor. Nevertheless, I do have large samples using a publicly-accessible detector simulation of CMS, for Higgs to di-tau for a range of Higgs masses.
I quickly ran a new training of the regressor, and was pleased to see that even without using the energy and mass features I was able to converge. In fact, using the new learning-rate methods I was able beat my previous performance. I then ran the regressor over the challenge data, and found that it kind of worked, with signal definitely peaking at 125 GeV. Unfortunately, the background also peaked pretty close to the signal, and the mass-fit feature still provided better discrimination (and had better signal resolution).
Still, checking correlation between the two features I found they were only slightly correlated, and so I might be able to eke out some last drop of performance from my setup.
The fates were not so kind.
Overall the exercise was fun, and as I mentioned at the beginning, I always pick up useful tricks or solve problems whenever I try these kinds of challenges. In this case, I sorted a problem with non-uniform sample weights, found a more reliable way of cutting on discriminants, and got more used to adjusting learning rates. I was also pretty happy with my final ranking, too.
I understand that the winner of the competition used deep learning, but I think that a lot of the high-ranked competitors still used BDTs, so I’m going to be interested to read through their approaches and find out how they did so well.
If anyone’s interested in taking a look at the code I used, it’s available on Github (though is a bit messy). Note that it relies on a collection of functions collected in this repository, and some of the functions changed names, so it’s not guaranteed that the older notebooks will run without a few changes. Also note that you’ll need to add the swish activation function to [keras install]/activations.py: def swish(x): return x*K.sigmoid(x).