Hej! I am ashamed to say that this is the only Swedish I was able to learn at the 2nd Machine Learning in High Energy Physics Summer School you already heard about from Giles. ML-wise the school was quite instructive, though, especially due to competitions organized during the school. I also have a challenge for you!
The school, which lasted a week, included lectures explaining different techniques and seminars where the techniques were actually applied to example datasets, all of them accessible at the official MLHEP2016 repository, so check them out!
While we started simple, reviewing basic supervised machine learning concepts and following a Numpy/Pandas/matplotlib tutorial, we rapidly got up to speed and moved to more advanced techniques (e.g. boosting, regularization and deep neural networks). In parallel with the mentioned not-so-basic lectures, advanced lectures on the use of ML algorithms for trigger selection, assuring research reproducibility when using ML tools, and track reconstruction were offered.
Data Science competitions
But that was not all: also two ML competitions were organized! I am quite fond of this kind of challenges. I participated in the Higgs Boson Machine Learning Challenge during the Summer of 2014, the first global HEP-based ML competition ever organized.
In case you have never heard of Data Science competitions before, here is how it goes. Typically you are provided with a dataset which you have to use to accomplish a certain machine learning task with a well defined rating score. In a classification competition, usually you have access to a labelled dataset which can be used to train your classifier and an unlabelled dataset aimed for scoring. The predictions over the unlabelled dataset are sent to the challenge organizers and they use the actual labels to score your submission according to their metric of choice (e.g. AUC).
The most popular platform for Data Science competitions is Kaggle, where you can find interesting machine learning problems to work on all year long, some of them with prizes of the order of 10000 $ or recruitment offers for the winning team. If you are into machine learning and have some spare time, I highly recommend joining one of the challenges. Friendly competition is quite a nice way to learn and improve, both in machine learning techniques and in the particular knowledge domain of the specific competition.
At the end of the challenge, commonly the best teams explain in detail how they achieved their scores, which is usually very instructive and gives you some insights for future competitions. The workshop challenges were organized with Kaggle in Class, where anyone can set up challenges for academic and educational purposes free of charge.
The aim of the first challenge was to develop a machine learning-based trigger system, able to classify an event as interesting enough to be kept or as non-interesting and thus to be thrown away. A trigger system is a well-defined classification problem, and complex machine learning rules based on several features can improve the fraction of useful data that is stored.
Indeed, the CMS experiment uses ML classification to evaluate muon isolation and b-tagging in the HLT (High Level Trigger), while some LHCb trigger paths are fully based on a ML classifier (i.e. topological trigger). When ML algorithms are used for triggering, they have to run online (i.e. within the data taking processing chain), so time required for prediction has to be kept small. This fact might limit the spectrum and complexity of classifiers that can be used, however this limitation was purposely neglected to simplify the challenge.
This competition was part of the school’s advanced track (i.e. aimed for people that had previous experience in ML), therefore it had some particularities which made it extremely interesting, but less straightforward.
The dataset provided came from LHCb MC simulations of interesting (class 1 – B hadron decays) and boring processes (class 0 – generic inelastic pp collisions). For each event a set of reconstructed secondary vertices (SV) was given, with each SV characterized by 13 features. The number of SVs for each event was not fixed, but the final classification score had to be event-based, not SV-based.
A way to use a classifier based on the SVs to trigger full events was then required. The baseline solution suggested computing the mean classifier output probability for all the SVs of an event. This physically does not make much sense, because for events with many reconstructed SVs we would like to keep them all, even in case only a few of the SVs are interesting. The average for events with many SVs and only a few with high classifier output would be washed out.
The first improvement that occurred to me was to use the maximum SV probability for each event instead of the mean, which allowed me to beat the baseline solution right away. The winning solution (spoiler alert: it wasn’t me!) went a bit further and trained an additional event-based classifier using the mean, maximum and standard deviation of these SV probabilities.
The score metric chosen by the organizer of the competition was a weighted AUC (i.e. area under the ROC curve), the weight being proportional to the frequency of each type of physical process. Technique-wise, the baseline solution provided was based on a scikit-learn Gradient Boosting. After checking that the baseline was well hyper-optimized, I moved to the awesome xgboost library. It also uses gradient boosting, but it is faster overall and includes regularization, which usually helps to avoid overfitting. The improvement was quite small, but running time felt shorter and sometimes you lose challenges by tinier differences.
Another important check when you are given a dataset for a classification task is to see whether it is well balanced (i.e. if there is approximately the same number of samples for each class). As I will explain in another post soon, an unbalanced dataset can cause a lot of frustration and effectively ruin your classifier training.
There are some resampling alternatives to handle these issues, but for most classifiers (e.g. GB, NN) the easiest approach is to use sample weights in the cost functions which account for the class imbalance. The Trigger challenge dataset was one of the most unbalanced I have ever played with. It contained about 300000 of desirable events, but only 10000 of boring processes, so I used the ratio between those quantities to weight the minority class. This further improved my validation dataset score.
Combining the mentioned techniques and optimizing the hyper-parameter scores, I was able to beat the baseline AUC in the public and private leaderboard and stay for a while in the first place of the competition. You can see some simplified code for obtaining a better than baseline solution in this Jupyter Notebook. However, I went to this school to learn and try new stuff that I had never tried before, so I teamed up with another participant of the school to try a different approach.
For his PhD he tries to apply deep learning techniques to identify different molecules from spectral data in proteomics. After some discussion, we decided that given the characteristics of this problem (many to one classification), we would like to try whether recurrent neural networks (RNN) could be of some use. I am planning to do a post on RNNs applied to HEP data at some point, but for the moment I will point you to this fantastic blog post by Andrej Karpathy.
The main point is that they can be used with input sequences, which is typically a time series from stock trading or a sentence in language processing. However, the RNN could potentially be applied to a set of tracks or jets in HEP.
Nowadays, thanks to powerful open source libraries, arbitrary deep neural network architectures can be declared with a few lines of code. We opted for the Keras library, which is a minimalist and modular library and allowed us to set up a basic LSTM RNN very easily. We played around with different layers, regularizations and architectures, but we were not able to beat the GB-based baseline with this classifiers.
I think that our main limitation was that we did not have enough data to properly train this kind of classifier, so that it captures the conditional pdfs of the sequence of SVs. It could be interesting to apply an RNN to similar many-to-one classification problems in HEP (e.g. b-tagging discriminator directly from a set of tracks or an event classifier from a set of jets).
This was the main competition of the school and has already been described by Giles in his post. The aim was to distinguish signal events (exotic Higgs production) from background events (top quark pair production). The dataset provided was a subset of the UCI Higgs dataset, with some unknown modifications done by the school organizers to avoid cheating (e.g. matching to the published dataset and checking the truth labels of the test set).
The leaderboard score of this challenge was also AUC, but this time using unweighted events. The dataset was created and used for the first published paper on the use of Deep Learning on HEP data, co-authored by AMVA4NewPhysics network participant Daniel Whiteson. This study showed that, for the chosen benchmarks, DNNs with only low level variables were able to perform better than BDTs or shallow NN, even when powerful human made features were added and similarly to DNNs with the additional high-level features.
While both low-level (jet and lepton momenta, jet b-tagging) and high-level features (invariant masses) were provided, further feature engineering was possible: computing some angles between objects, reverse-engineering object energies based on invariant masses (only momenta, phi and eta were provided) or other custom physics based features.
Nevertheless, in view of the paper results, I decided a priori that I would try using only the features that were provided in combination with Deep Learning techniques. For its simplicity, I opted again for using the Keras library to train a DNN with 5 hidden layers of 1024 neurons each, using rectified linear unit (ReLU) activation functions. The final layer was a single neuron with a sigmoid activation function, to provide a real valued output in the [0,1] range for the loss function (i.e. binary cross entropy). I tried some experiments with regularization and dropout tecnquiques, to reduce over-fitting, but I could not beat my unregularized DNN within the training time I had.
It was a large dataset, with a total of 10.000.000 events for training, so it was a good match for deep learning. In fact, we were given access to a huge GPU cluster at the Finnish National Supercomputing Centre, so it was clear that deep learning was the way to go. The difference achievable when using DNN with respect to other classifiers (e.g. Random Forest or Gradient Boosting) was quite impressive.
The best GB classifier I tried achieved a classification score of 0.81547, while just the first DNN I tried got 0.85145 AUC. By using a larger network, training more epochs with different batch size and also randomly rotating and z-inverting jets within each training batch I was able to obtain 0.86752 AUC in the private leaderboard. This got me in the 4th position of the challenge.
Finally, a challenge for you!
Given the length of this post, you should get a prize if your are still reading it, but I will make it a bit harder for you. The first day of MLHEP2016 we were given some Yandex branded pens at the registration desk, but they had a four number combination lock and a mathematical riddle to obtain the right combination:
Supposse a=b/2, c=b-a, d=3b.
Funnily enough and for the convenience of the school attendants, the pen was given with the right combination set by default, but I did not know it so I started playing with the combination lock right away. Therefore, I had to crack the combination before I could use it. I could have also asked for another pen and checked the right solution (all of them had the same riddle), but it would have been way less fun.
So my challenge for you, in case you like mathematical riddles, is to solve the riddle and obtain abcd that allows me to use the pen. You can use whatever you want to solve it, but I need an explanation of how you did it (e.g. text, code, photo of blackboard).
I have created a form you can use to send your solution. There will be rewards for the best solutions. I recommend putting the explanation (and/or code if you used it in the format you please) of your solution in a private Github gist and paste the URL in the form. To rate solutions I cannot use any quantitative score (apart from checking whether you got it right), so I will ask the other two ESRs that attended the MLHEP school (Giles and Cecilia) to help me select the best solutions.