Well folks, it’s been quite a while since my last post; apologies for that, it’s been a busy few months recently.
Towards the end of last year I wrote a post on optimising the hyper parameters (depth, width, learning rate, et cetera) of neural networks. In this post I described how I was trying to use Bayesian methods to ‘quickly’ find useful sets of parameters.
The problem lies in the fact that the parameters are closely tied, and altering one affects what the optimum values of the others are, therefore they cannot be optimised separately. On top of that, it is normally required to train a network to converge (time consuming), and more often than not, take the average of several trainings.
The traditional methods for optimisation are to use a grid-search (sample sets of parameters evenly to fully populate the parameter space) or a random search (randomly sample parameter sets from the parameter space). The Bayesian optimisation method instead relies on trying to build a model of performance as a function of the parameters and continually update it as more points are sampled.
The model is initialised using a few (~10) random points and then an acquisition function to choose which point to sample next. This function makes a trade off between exploring the parameter space (sampling regions where the model has a high uncertainty), and exploiting previously found regions (sampling in regions of high performance to find the optimum location).
This sounds fantastic in principal, but in practice I couldn’t get it to work. It was still too slow to run, had its own hyper parameters to tune, such as the choice of acquisition function, and the weight they gave to exploration versus exploitation, and didn’t appear to be able to capture the mutual dependencies of the parameters. Eventually I gave up and went with the best settings I had.
Optimal learning rates
A few weeks later I began following the fast.ai deep-learning course and in the first lesson was shown a way of being able to quickly find the optimum learning rate for a given architecture and dataset.
The LR range test (Smith, 2015) involves training a model for a single epoch starting with a very small learning rate. After each minibatch, the leaning rate is increased. Eventually the learning rate becomes large enough for the model to start training and the loss decreases. Gradually the learning rate becomes so large that the model cannot learn and eventually diverges.
By plotting the loss as a function of the learning rate, one can see the range of usable learning rates. Since higher learning rates avoid overfitting and provide quicker convergence, the recommendation is to pick the highest LR at which the loss is still decreasing (NB not the point where the loss is lowest).
This is great since in just a few seconds I can know what LR to use with rest of my architecture, effectively reducing my parameter search-space by one dimension. Earlier this year Smith published the first part of a more complete guide to tuning hyperparameters (Smith, 2018).
The meat of the paper begins with a summary of underfitting and overfitting, and how these may be diagnosed from examining the validation loss during training: underfitting = slowly decreasing loss, overfitting = rising loss. Ideally one want to be in the flat plateau between under and over fitting.
The next section deals with optimisation of the hyperparameters. It begins by reintroducing the LR range-test and then discussing how the learning rate should evolve during the training in order to benefit from reduced train time of large learning-rates without letting the network diverge. Effectively one should cycle the learning rate between a low value and a higher value and back down again. Pairing this with a momentum evolution in the opposite direction (i.e. out of phase) he shows that a phenomenon called superconvergence can occur, in which complex models may be trained about eight times quicker than normal using a special policy called 1cycle.
Optimisation of batch size is discussed next. This can be tricky since the optimal value is also a function of computer hardware. Traditionally, tests either keep the number of epochs the same, or the number of minibatches. The paper argues that keeping the epochs constant penalises large batch sizes, since fewer parameter updates are made. On the other hand, keeping the number of minibatches the same favours larger batch sizes too much. The suggested approach is to keep the train time constant and compare the relative performance of each size after this fixed training.
The third part of the section looks into scheduling the momentum of the optimiser. It finds that a momentum range-test is not useful for finding the optimum value, and instead several values must be tested. It does, however, find that cycling the momentum can be useful when using cyclical learning rates. The evolution should be opposite (i.e. when the LR in increasing, the momentum should be decreasing), this way they can act to stabilise one another.
The end of the section deals with weight decay (a form of penalty-based regularisation similar to L2 (and equivalent for basic SGD)). It argues that the amount of weight decay should be kept constant during training and that it should be tuned last in order to work well with the regularisation already present in the system due to the learning rate, momentum, batch size, and data. Unfortunately, it doesn’t offer any smarter way to optimise than a gridsearch, but it does give some hints on what values to try using.
Applying this to physics data
The context of the paper was training convolutional neural-networks for image classification. I’m always curious to see whether new deep learning methods can be applied outside of the research area in which they are developed and presented, so tried to reproduce the methods (and results) in the paper on a classifier acting on some open access CERN data (the same dataset that was used for the 2014 HiggsML Kaggle challenge).
In my attempts at the challenge (part 1, part 2), I had already made great use of the LR range-test, and was able to reproduce the effect of learning rate on under-over fitting of the model: Section 3.
Examining the effect of LR cycles, I was able to confirm that cycling the LR provided a better training than a fixed LR, however I wasn’t able reproduce the superconvergence using the 1cycle schedule. It is possible, however that the cyclical learning rate on its own was already exhibiting superconvergence. I also tested the linear cycling against cosine annealing with restarts, and found the latter to provide slightly better performance and a more stable evolution: Section 4.1.
Attempting to optimise the batch size, I found that above a certain size, all sizes reached about the same level of performance, however as the size increased, the loss evolution began to become unstable and fluctuated a bit: Section 4.2.
Trying out cyclical momentum, I was able to confirm the paper’s statement that large constant momentum should be used for constant learning rates. However, I had some difficulty in getting the cyclical LR + cyclical momentum setup to work, and in the end found that cyclical LR + constant momentum worked better: Section 4.3.
Weight decay (or actually L2) was something I’d always struggled with; whenever I tried adding it, my models became untrainable until I decreased it down to very very low values, far beyond what I typically read in literature. Reading this post, and looking at the equations, I finally worked out why:
final_loss = loss + wd * all_weights.pow(2).sum() / 2
The penalty coefficient (wd) needs to be scaled relative to magnitude as the loss (NB, the formula is for weight decay, however this is equivalent to L2 for vanilla SGD, and more easily shows the relationship between loss and penalty). Since I apply sample weights to the loss function to account for class imbalances, production cross-section, and acceptance efficiency of events, my typical loss values are around 1e-5, where as the typical literature examples expect losses around unity.
Once I accounted for this scaling I was able to start converging again. However, scanning across a range of L2 values I eventually found that the models performed best when L2 was not used. Possibly this indicates that I should try and reduce other forms of regularisation in the system as well: Section 4.4.
Overall, it was a fun exercise, and I was able to get some more experience adjusting hyperparameters and well as better appreciate the sources of regularisation within an architecture and dataset. Since I found that cosine annealing provided a better schedule for LR, I’m curious to see whether I can match it with an equally suitable momentum (beta1) schedule. I’m also looking forward to part two of the paper, which promises to focus on architecture, data, and other regularisation sources.
The code I used is available here, and I’ve also created a Binder instance and a Docker image to run it.
Image credit: Stephen Drake (originally posted to Flickr as modsynth) [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)%5D, via Wikimedia Commons