A few months ago I wrote about some work I was doing on improving the way a certain kind of particle is detected at CMS, by replacing the existing algorithm with a neural network. I recently resumed this work and have now got to the point where I show significant improvement over the existing method. The design of the neural network, however, was one that I imported from some other work, and what I want to do is to adjust it to better suit my problem.
A neural network contains many free parameters, which can be adjusted. The majority of these, the weights and biases, are learnt by exposing the network to many examples of data-points. However, this requires the network architecture to first be defined, and the handles which do this are referred to as hyper-parameters.
These define: the layout of the network, e.g. number of layers, number of neurons in a given layer, and the activation function connecting a given layer to a subsequent one; the way the network learns, e.g. what optimisation algorithm is used and its hyper-parameters, and what mini-batch size to use (in the case of stochastic gradient-descent methods); how the free parameters are initialised; and what methods of regularisation are present, e.g. how long to train for, penalty terms to weight updates, and dropout probability.
The number of hyperparameters can easily become quite large. However, my case is relatively simple and requires a basic feed-forward classifier, where I take some input information and predict whether the given data-point is of one class or another. To reduce the number of hyperparameters I made some simplifications:
- Use a box-layout, in which the number of neurons in each hidden layer is the same: 2 parameters, depth and width.
- Use the same internal activation between layers and the recommended weight initialisation from the respective paper: 1 parameter.
- ADAM appears to be the good default optimisation algorithm, so use that and just adjust the learning rate: 1 parameter.
- The size of each minibatch: 1 parameter.
- Regularisation terms are L1, L2, and dropout. I’ve already performed feature selection, so L2 is expected to be better than L1 and I’ll use the same value for each weight in the network. Similarly I’ll use dropout on every hidden layer, with the same probability: 2 parameters.
This effectively means that I’m trying to find the optimal point of an expensive (1-10 hours per evaluation) and unknown function in a seven-dimensional space. Which isn’t easy.
With regards to the choice of activation function, there are three I wanted to try: the default baseline, ReLU; and two new ones that were published this year, SELU and Swish. Unfortunately, I don’t have any GPU access at the moment, but I do have a laptop and 2 virtual machines in the cloud, each with the same computational power, so I can reduce the search space by one dimension by getting each to focus on one activation function.
Six dimensions is still quite a lot. The naïve approach is to try out each combination at regular intervals (grid-search), which would take ages and potentially miss out optimal points. Another method is to try a number of different, randomly selected, parameter sets (random sampling), and perhaps then interpolate between the evaluations to predict the locations of minima.
Other methods make smart use of previous evaluations to converge on the minima, like the Metropolis algorithm. I’d used the Metropolis algorithm a few years ago and found it to be very slow and requiring a lot of adjustment. Last year, though, I learnt about a process called Bayesian optimisation.
This method is seeded by some random evaluations of the cost function, which are then used to build up a prior understanding of the shape of the function. Each subsequent evaluation is used to update this understanding into a posterior distribution, which is then used to predict where the next point should be placed, via an acquisition function.
The acquisition function takes into account where good points have previously been found, and where the uncertainty of the posterior is large, i.e. where there is a lack of evaluations. Effectively it trades off exploiting well-known minima to find the global minimum, and exploring areas of uncertainty, which might contain even better points.
The main limitation is the training time for a given configuration. Over the past few weeks I’ve tried to streamline the optimisation to only train potentially good architectures, and ones that can converge in a reasonable time. I found that L2 penalty and dropout probability were almost mutually exclusive, and that dropout performed better than L2, so I removed it.
I later found that the optimisation generally preferred very low dropout, so I ended up removing that completely as well. Lowering the learning rate (I search in log-space), increased the number of training epochs before early stopping kicked in. However, beyond a certain point, the improvements became minute, so I ended up fixing it at what appeared to be a decent value.
Another issue seems to be noise in the evaluations; the fact that even if the same configuration were run twice, the performance would be different due to changes in the weight initialisation and ordering of the training data. Currently, I evaluate once per configuration, but perhaps I should move to running several times and reporting the mean performance. Indeed this is what scikit-optimise does in one of its examples.
Luckily, I should be getting some time on a GPU cluster next year, so this might be possible. Scikit-optimise does have the potential to account for noise in the cost-function, but I guess it needs more evaluations to get a better grasp. Another possibility might be to move to a less noise-prone metric: currently I use the loss on a testing sample, but maybe the ROC AUC might be more robust.
At the moment my search space is reduced to just three dimensions: depth, width, and mini-batch size. Optimisation, even over this, seems difficult, however. From the diagnostic plots of the partial dependence of each dimension on the cost-function, the expected performance of points does not match the reality. Comparing to an example of the optimisation of a BDT (a simpler kind of machine learning algorithm), my partial dependencies don’t display any kind of bounded minima like those of the BDT.
Reading a post by Nvidia on hyper-parameter optimisation, they manage to work in a 12-dimensional parameter space using a closed-source ‘optimisation as a service’ product called Sig-opt. This supposedly outperforms the open-source solutions by taking an ensemble of Bayesian methods. They offer a few free account options. However, these are strictly aimed at education and couldn’t be used for the work I’m doing here. The search continues.