Last time we looked at how we can could fix some of the problems that were responsible for limiting the size of networks we could train. Here we will be covering some additions we can make to the models in order to further increase their power. Having learnt how to build powerful networks, we will also look into why exactly neural-networks can be so much more powerful than other methods.
The initialisation methods introduced last time (Xavier and He) were calculated on the assumption that the inputs are unit-Gaussian. Sometime this isn’t always the case: perhaps the data is not preprocessed, or the signals in the network get thrown out of their unit-Gaussianness. This can mean that the initialisation of the weights isn’t always optimal.
The simple solution: normalise the signals after every layer. Introduced by Ioffe and Szegede in 2015 the method of batch-normalisation involves applying a transformation to each batch of data such that it becomes a unit-Gaussian. This means the initialisations are more optimal, which leads to quicker convergence in the training.
A modification, batch renormalisation (Ioffe 2017), keeps a running average of the transformation, making it applicable to smaller batches of data, or data which is not i.i.d.
Why have just one network when you could have 3, 5, 10 even!? Training the same network multiple times will result in slightly different networks, but a single trained network is unlikely to be optimal for the whole range of inputs. By combining the predictions of several networks, the overall prediction is likely to be more accurate.
You could try different weighting schemes, or even add in other ML methods such as BDTs to the ensemble. Personally, I train my networks multiple times, select the best n, and weight their responses according the performance of each one. This allows the most performant network to have the greatest influence, but still be supported by lesser networks in sub-optimal regions of input space.
Introduced by Hinton et al in 2014, this technique is slightly counter intuitive to begin with. Rather that using the full network during training, neurons are randomly masked out during each training epoch according to a defined probability. During each training epoch, the masked neurons are treated as though they are not there.
By masking off neurons, the training forces the network to generalise to the data and not become overly reliant on certain inputs or signal paths.
Although one can expect each training epoch to be shorter, since there are fewer parameters to evaluate, a network which uses dropout can actually take longer to converge, since each epoch trains a separate sub-network. This effectively gives ensembling for free, since these smaller networks eventually get combined into the full network during application.
One subtlety is that during training the inputs of the neurons must be scaled to account for the fact that not all of the neurons in the previous layer are active. This ensures that when the whole network is used in application, the levels of activation are of similar level to what they were during training.
Advantages of neural networks
Many ML methods have linear responses and have difficulty fitting to non-linear data distributions. Ensembling methods, such as random forest (an ensemble of decision trees), can effectively provide non-linear responses by combining the predictions of many linear classifiers. Neural-networks, however, provide direct access to non-linear fitting, due to the output being the combination of many non-linear activations functions.
Revisiting Part 1’s example of point classification:
There’s no way to place a one-dimensional decision-boundary to separate well the orange and blue points. A neural-network, however, can directly fit boundaries to the class distributions:
A BDT can get close, by combining many linear classifiers with different weights, but the decision boundaries form corners which clearly don’t reflect the data:
Linear classifiers really rely on something called the kernel trick. This is the application of a kernel function to the data which warps the feature-space such that classes in the data become linearly separable:
Here in two dimensions the classes are not linearly separable, but when the radii of the points are calculated, the data can be projected into a third dimension in which they can be separated by a one dimensional boundary.
In the point-spiral example, a function of radius and azimuthal angle could be used to separate the classes in 1D. In high-energy physics, the mass of a decaying particle is often used to separate event processes. These ‘high-level’ features are often non-linear combinations of basic features, meaning that for linear classifiers it is often necessary to calculate them beforehand and then feed them into the classifier. This method of feature engineering requires a high level of domain knowledge; one has to realise a priori that certain combinations of features are likely to improve discrimination, and it is likely that other highly discriminant features will be missed.
The neural-network, on the other hand, is able to learn these high-level features automatically. Effectively, by training the network to predict the class of a point the network learns how to project the data into a new dimension in which the points are most linearly separable.
Well, we finally got to the end. Hopefully this has been useful (and comprehensible) to you. If you are eager to try out some neural-networks then I can recommend this playground, and for your own work, Keras. Indeed if you want to see more recommendations for tools then check out my earlier post on what I use. I’d also like to recommend, and credit, Andrej Karpathy’s lecture series; well worth watching if you want to learn more.