by Giles Strong

Welcome to the third part of my introduction to understanding neural networks; previous parts here and here in case you missed them.

So it’s 1986, we’ve got a mathematically sensible way of optimising our networks, but they’re still not as performant as other methods… Well, we know that adding more layers will increase their power, let’s just keep making them larger. Oh no! Now the network no longer trains! It just sits there refusing to optimise.

Eventually Hinton and Salakhutdinov come along in 2006. They’ve developed a method of pre-training each layer in the network. Now when we sandwich the pre-trained layers together suddenly back-propagation works again and we can build larger models! This method of pre-training is quite time-consuming though. Is it really necessary?  Let’s take a look at our networks and try and find out why they wouldn’t train.

## Problem 1 – Activation function

During the development of neural networks, they were often likened to neurons in the brain, which transfer signal via a firing rate. This connectionist idea was one of the inspirations for the common choice of activation function being the sigmoid. Bounded in output between 0 and 1, it effectively squashes incoming information whilst providing a smooth, symmetric turn-on, meaning that it’s continuously differentiable. It seems like a nice choice, but let’s take a closer look at it.

One thing we can see is that the gradient converges asymptotically to zero at large absolute values of x. This means that if the activation of the neuron is not in the central region, the local gradient is zero. During back-propagation, the incoming gradient gets multiplied by this local gradient, meaning that sigmoid neurons can easily ‘kill’ gradients. This prevents the neurons from being able update their weights since the updates depend on the gradient; if this is zero, then the weights stay the same for every iteration.

The second problem is that the outputs are not zero-centred. This means that the gradient propagated to the weights is either always positive or always negative. If the optimum set of weights happens to be a mixture of positive and negative weights, then this can only be reached by a series of inefficient zigzag updates.

The third problem, although only minor, is that the exponential function in the sigmoid can be expensive to compute.

An attempt to fix one of these problems is found in using the hyperbolic tangent as an activation function; the shape is effectively the same, except that the output is now between -1 and 1, meaning that it is zero-centred. It still features saturation and an exponential function, though.

A better solution is to use a rectified linear-unit (ReLU).

Here the gradient never saturates in the positive region, and its very quick to evaluate. It still has gradient saturation in the negative region, features non zero-centred outputs, and depending on weight initialisation can never activate (or get thrown out of activation). Still, its use was shown in 2012 by Hinton, Krizhevsky, and Sutskever to provide six times quicker convergence over tanh.

There are various adaptations which attempt to address these problems, such as ELU, PReLU, and LeakyReLU, which might be worth a look if you’re developing your own networks. Still, by reducing the amount of dead gradient in the network we’ve managed to solve one of the problems in getting large networks to train. Next!

## Problem 2 – Initialisation

In order to optimise the weights in the network, we need to have some initial values to start from. We can’t set them all to the same value, because then the network would all respond in the same way, the gradients would be equal, and nothing would train.

The default approach is to sample the weights from a Gaussian distribution and times them by some factor. If this factor is set too high then the tanh neurons will saturate for even small inputs, causing gradients to drop to zero. If the factor is very low, the output of the neurons is tiny and the gradient propagated to the weights (which is proportional to the input from the previous layer) vanishes. The factor would have to be set carefully by hand in order to allow the network to even begin training.

That was until 2010, when Bengio and Glorot derived a mathematically motivated initialisation scheme (Xavier initialisation). Their derivation boils down to setting the factor equal to one over the square root of the number of inputs to a neuron. Neurons with lots of inputs have lower weights, whereas ones with fewer inputs get larger weights. Activation levels are consistent across the network and at a level where, for unit-Gaussian inputs, do not lead to vanishing gradients.

A minor modification is required for ReLU-based networks in order to avoid lots of dead ReLUs. This is to divide the number of inputs by two when calculating the factor, (He et al 2015).

## Problem 3 – convergence time

As seen in part one, the gradient descent algorithm works fine for finding the local minima, however one side effect of moving in the direction of steepest slope is that it becomes very easy to get stuck in gently sloping valleys.

The updates cause oscillations between the sides of the valley, whereas the optimal path would be to moving along the centre of the valley floor.

The standard updates to the parameters are as follows:

An improvement would be to allow the updates to accumulate momentum. This momentum would cancel out in the inter-hill oscillations but would accumulate along the valley floor, reducing the time spent in the valley.

This addition of momentum leads to quicker convergence, but can cause the updates to overshoot the optimum point, causing the updates to have to backtrack. An improvement is to realise that the steps in parameter space are the vectorial sum of a momentum update and a gradient update.

Regardless of what the gradient is, we are going to make the same momentum step. Let’s make the momentum step first and then evaluate the gradient. This effectively provides a one-step look-ahead and leads to quicker convergence. This type of momentum update is referred to as Nestorov-assisted gradient descent.

An alternative way of reducing convergence time is to allow the learning rate to adapt to the gradients; for steep slopes we want to take small steps, but for shallow slopes we want to take larger steps. The ADAGRAD algorithm (Duchi, Hazan, and Singha 2011) assigns each parameter a learning rate and divides it by the square root of the square sum of past gradients.

One problem with this is that over time the learning rate drops to zero. By allowing the store of past gradients to decay (RMSProp, Hinton & Tieleman 2012) the scaling instead becomes a moving average of recent gradients.

Both scaling of the learning rate and accumulation of momentum results in reduced convergence time, so why not combine them? ADAM (Ba & Kingma 2014) and NADAM (Dozat 2015) do just this!

Comparing the convergence of the optimisers, we see that: the normal gradient-descent in red moves very slowly as the gradient reduces; the addition of momentum (green) allows for a much faster traversal of the shallow regions, but overshoots the minimum; the one-step look-ahead offered by adding Nestorov assistance (magenta) causes the updates to curve in earlier, reducing the overshoot; the adaptive optimisers (ada* & rmsprop), follow the same general trajectory as sgd, but again move much faster through the shallow slopes.

Well now we’ve finally fixed the problems with our networks and they’re outperforming other machine learning tools, but we can go even further than simply fixing the problems. In the fourth part we’ll going through some of the additions we can make to neural networks in order to make them even better!