by Giles Strong

Continuing the series of 101 things to do in the cramped confines of a budget airliner:

Last Saturday evening I flew back from the mid-term meeting of my research network. The trip from Brussels to Lisbon takes about three hours, and since my current work requires an internet connection, I’d planned to relax (as best I could). Idle thoughts, however, during a pre-flight Duvel had got me thinking about autoencoders.

These are special cases of neural networks (NNs) which simply aim to reproduce the information which is sent into them. The catch is that within the network there are not enough neurons in at least one layer for each input to have its own neuron in each layer. This means that the information carried by the inputs must be encoded into a smaller amount of information, and then decoded to reproduce the same information later. This might seem a strange thing to do, since all you get at the end is a degraded version of what you already had, due to imperfect compression/decompression. However, the benefit comes when information from a different source is used as an input.

When trying to compress the information, the NN must cut corners and make approximations, and to get the best results it will focus on and optimise for the particular data it sees during training. These assumptions generally won’t be transferrable to other data, meaning that the other data will be incorrectly encoded and decoded, causing the outputs of the encoder to be noticeably different from the inputs.

This reliance on the difference in response is similar to my current work on the use of regression for feature optimisation, however autoencoders are generally used for anomaly detection, where one wants to search for something out of the ordinary, but doesn’t know what the something will look like, or doesn’t want to assume its signature.

Fellow ESR Alessia had been working on them during her secondment at a consultancy company, and afterwards described in a seminar how she’d found them to be useful for detecting credit-card fraud. Effectively, an autoencoder can be trained on a dataset of verified transactions and then be shown transactions which may contain fraudulent transaction, whose information will be poorly reconstructed by the autoencoder. The quality of the reconstruction can be measured by summing up the square of the difference between the inputs and outputs of the encoder, i.e. the squared error. Genuine transactions should be clustered at low values of this loss function, and fraudulent ones away from zero. A cut on the loss can then be placed in order to allow it to flag fraudulent transactions in real-time.

My work revolves around supervised classification, where I have training data for both classes (signal and background), so this unsupervised learning should not be necessary. However, as mentioned in my last post, I’ve recently been working on using the Matrix Element Method (MEM). This calculates a weight for each datapoint according to how likely that point is to belong in a specified class. Discrimination can then be made between classes by computing the weights under each class hypothesis and then comparing the ratios. These weight distributions were found to be useful inputs to a dedicated classifier.

I wondered whether autoencoders could be used the same way: by training a unique encoder for each class (i.e. one for signal and one for background), if a similar hypothesis ratio could be constructed by replacing the MEM-weight with the encoder-loss distributions.

I was still thinking about this when I boarded the plane, and after takeoff began working on a prototype to see whether it was possible. Over the course of two hours I hacked up bits of existing code and found that indeed, the autoencoder response was different for signal and background, and that a discriminator could be built by comparing ratios.

Since the data I’d used was CMS-restricted, today I reran the test on some shareable data. Below are the resulting distributions and a Jupyter Notebook is available here in case you want to run it yourselves. Indeed, the encoders reconstruct the sample they were trained on better than the other (trained sample closer to zero).

lambda
Discriminator based on the ratio of losses.

I found that the classification performance was worse than what I could get with a dedicated classifier, but it still provided some good discrimination, and could potentially be used to build high-level features to then feed into a classifier. This is something I’ll need to test.

High-energy physics is plagued by systematic uncertainties, which account for poor modelling of samples, detector effects, and other nuisances. An interesting use case could be finding some data features which are unaffected by these systematic uncertainties, but offer little discrimination on their own. These features could then be used as the inputs to the autoencoders to build a discriminator with a much lower systematic uncertainty than a traditional classifier might have. It could also be used, perhaps, for a slightly targeted general search for new physics, by computing the losses for several different new-physics models.

All in all, I think I managed to make good use of the flight. Next Saturday I’ll be returning back from a workshop in Padova on statistical learning, by renowned-author Trevor Hastie; hopefully I’ll have gained some similar inspiration to make use of the flight back.