by Giles Strong

Both the CMS and ATLAS collaborations are pretty vast, with around 5000 qualified scientist between them, and even more members working towards qualification. Everyone listed as ‘qualified’ will be listed as an author on any publication the collaboration produces, regardless of who actually did the major work for the analysis.

This might seem like ’cheating’. However, in order to reach, and remain at, qualified status, members must actively contribute to the experiment. These ‘service tasks’ are measurements, developments, and other tasks which are necessary for the continued running and growth of the experiment, and will be of benefit to all members of the collaboration. I recently began two such tasks for the CMS experiment. I’ll describe the first here and the second in a following post.

In my main analysis work I develop machine learning techniques to search for particle collisions in which two Higgs bosons are produced, and subsequently decay to a pair of bottom quarks and a pair of tau leptons. It was a good fit, then, to be asked to work on upgrading the CMS algorithms used for identifying tau-jets by replacing them with deep neural-networks.

Tau-jets are formed in certain decays of the tau leptons. They are detected in CMS as deposits of energy in the calorimeters. Unfortunately, many other processes also deposit energy in the colorimeters and the challenge is to differentiate those which really come from tau decays and those which fake being tau jets; potentially these could be electrons, muons, and QCD jets.

Tau-jet reconstruction and identification takes place in two steps at CMS. First, tau-jet candidates are reconstructed via the hadron plus strips algorithm. This takes the jets which are already reconstructed by CMS and searches for neutral pions and charged particles within the jet. The tau candidate is then accepted into one of four categories, depending on the number of pions and charged particles found.

It is likely that other processes can produce acceptable tau candidates, and so the identification step intends to examine the candidates more closely and weed out those which are likely to be fake taus. Currently in CMS this step involves feeding the candidates into three boosted decision-trees (BDT), each designed to reject fakes from different sources: electrons, muons, and QCD jets.

These discriminators occasionally make mistakes and reject true taus, or accept fake taus as true, and balancing the tradeoff between efficiency (true acceptance) and mistake rate is done by defining working points for the discriminators values, allowing physicists to move from a loose selection (large number of accepted taus, with large amount of fake taus), to a tight selection (higher purity of true taus, but smaller sample, as true taus are cut out too).

I’d performed a similar task for my undergraduate Master’s thesis, where I was optimising the parameters in top-tagging algorithms. This time, though, I’m getting to write the tagger, not just optimise it!

As an initial step I focused on the jet→tau category and built a baseline classifier with an ensemble of BDTs. As input features I took those already being used in the current classifier. By eye, the performance appears to be similar to the existing results, despite me only using a fraction of the available training data.

The current features are transformed (using log, abs, etc.) combinations of the raw features of the tau candidates. Moving to the raw features and adding a few new ones improved my performance. The next step was to move to using a neural network for the classifier. Unfortunately, plugging in a three-layer ReLU network produced disappointingly similar results (and took three times as long).

There are still optimisations that can be done, parameters that can be tweaked, and additions that can be made, but it’s possible that this might be an opportunity to design ‘something cool’ in order to really boost performance. It’s quite an exciting project actually, and for once the amount of data available isn’t a limiting factor. Hopefully I’ll post again later with some improved solutions. We’ll see!