Hej! It’s been about a week now since I returned from Sweden, where I’d attended an excellent school on machine learning at Lund University. The course consisted of a series of lectures and seminars which started from the very basics of machine learning, and finished with us training convolutional neural-networks on GPU clusters kindly lent to us by the Finnish National Supercomputing Centre!
Lund is a small university-town in south-west Sweden, and is quite similar to Durham in England, where I did my undergraduate masters. I’d visited it a few years prior whilst on a camping trip around Fennoscandia, and had really liked it, so I was pleased to be back.
The school itself had been organised by Yandex, a Russian search-engine provider, who have both expertise in machine-learning and ties to high-energy particle physics, having worked closely with the LHCb experiment at CERN. Yandex is also one of the partners of the AMVA4NewPhysics ITN, and I will be spending some time next year working with them on secondment.
One of the centre pieces of the school was a challenge to use our newly acquired knowledge to develop and refine a multivariate classification algorithm to separate signal from background in some simulated particle-collisions, with the signal being the production of some exotic Higgs-boson, and the background being top-quark pair-production.
We were provided with a training set of data consisting of 21 low-level features, such as jet momenta, and seven high-level features, which were various invariant masses, and a target value; 0 for background, and 1 for signal. We were then asked to run our MVA over a test sample, and submit our predictions for the target value. Our submissions were then compared to the real target values, and the performance of our MVAs ranked on a live leader-board. The comparison metric was the area under the ROC (receiver operator characteristic) curve, which characterises signal acceptance as function of background acceptance, and should be as close to 1 as possible.
I had initially started with an MVA based in TMVA, a machine-learning package included in ROOT (a data-analysis package commonly used for HEP work), however a mistake in my Makefile resulted in it deleting all my source code… DOH! Difficulties in Dropbox recovery meant rather than being able to recover the files manually, I instead had to arrange for an account roll-back (which eventually happened over a week later).
This however turned out to be good in the long-run; we were being taught using Python-based modules during the seminars, so I switched over to use MVAs from the SK-Learn package, which turned out to be much easier to use, and more adaptable than the TMVA ones. It also meant that I was able to follow more closely what we were learning during the lectures.
I began with a boosted decision-tree (BDT); a classifier which implements a set of decision trees (which apply a series of cuts to variables), and takes the average of their outputs. The ‘boosted’ part comes from the fact that, during training, it focuses more on events which in previous training epochs had been misclassified. With a ROC-curve integral of ~0.76, this offered a nice improvement over the 0.71 baseline we’d been proved with, which had used a nearest-neighbours algorithm. Both had used just the high-level variables for classification. Adding in the low-level features improved my score to 0.79.
During the lectures we had been told that, in machine learning, the majority of our time should be spent on feature engineering, the process of creating and refining new high-level variables, in order to get better performance, rather than adjusting the parameters of our MVAs. Indeed, I found that altering my BDT’s setting from default often resulted in worse performance, and so concentrated on coming up with new variables. Two of the strengths of decision trees are that they don’t care how many features you feed them, or what the ‘scale’ (relative size) of the variables is. So I was free to throw as much as I wanted at my MVA.
I calculated all the angles between the final-states, and added in the sum of |pt|. This offered some improvement. Ordering the jets by pt gave quite a relatively large improvement. My room-mate at the hostel had recommended ‘lepton pt+MET pt‘; again some improvement. Then I tried some experimental variables; timesing things by b-tag values, adding angles to masses. Sadly, my abominations gave only slight improvements.
By this point I had broken the 0.8 mark, and the seminars had progressed on to neural-networks. An example in the exercises reached ~0.81 just using the default high-level variables! At this point I should have switched over to neural-networks, but instead I chose to concentrate on the seminar exercises and learn more about the new modules, whilst the lecturers were still present to answer questions.
With the neural-network exercises came a larger training set (ten million events!), so I trained my BDT once more and submitted my final score of 0.81721. At the end of the school, the top three people presented their solutions. Needless to say they were all using neural-networks, but it was interesting to to see what features they’d engineered, and how they’d occasionally combined MVAs to produce a final value.
Overall, I felt like I’d learnt a lot over the week, and will most probably switch over to Python for implementing MVAs in my research. As easy as BDTs are to use, I have a bias towards neural-networks, and it was pleasing to see them perform so well! I also got to spend some time back in lovely Lund, and experience Swedish culture at the Midsommar celebrations. My thanks to Yandex and Lund for taking the time to organise and run the school.