Welcome back to the second part of my introduction into how neural-networks function! If you missed the first part, you can read it here.
When we left off, we’d understood that a neural network aims to form a predictive model by building a mathematical map from features in the data to a desired output. This map takes the form of layers of neurons, each applying a basic function. The map is built by altering the weights each neuron applies to the inputs. By aiming to minimise the loss function, which characterises the performance of the network, the optimal values of these weights may be learnt. We found that this can be a difficult task due to the large number of free parameters, but luckily the loss function is populated by many equally optimal minima. We simply need to reach one, and can therefore employ the gradient descent algorithm.
Gradient descent involves evaluating the gradient of the loss function at the current point in parameter space, and descending in the direction of steepest slope. One way to evaluate the gradient would be to alter each weight by a small amount and checking how the output changes: the numerical method. This is easy to do, but potentially time consuming since we then need to evaluate the loss function once per free parameter.
The activation functions applied by the neurons, however, are generally chosen to be continuously differentiable; this means that the whole network from start to finish is differentiable and so we are able to analytically derive the gradient from just one evaluation.
Updating the network becomes a two-step process: First we take a data-point and do a forward pass through the network. We can then evaluate the loss function from the output. Next we do a backwards pass through the network of the gradient of the loss function. The process of back-propagation allows us to analytically derive the effect of each free parameter on the output of the network. We can then adjust each parameter by taking a step down the gradient.
A simple example is given below.
Here we have a small network with three inputs: x, y, and z. These pass through two neurons and produce the output, g. Say we want to decrease the value of g, how should we alter the inputs of the network?
Let’s take a test point of x=3, y=-4, z=2 and evaluate the output. We have: f equals x times y, equals -12; and g equals f plus z, equals -10.
Now let’s back-propagate the gradient through the network: The gradient of the output with respect to itself is simply one.
Now moving through the “plus” neuron we want to know the effect of input z on g:
Similarly of f we have:
Now we want to propagate the gradient through the “times” neuron to evaluate the effect of x and y on g:
From chain-rule we know:
We’ve already calculated the incoming gradient, we just need to multiply it by the local gradient:
So the effect of x on g is:
Similarly with y:
So now we know how each input affects the loss function, we can optimise the inputs by taking one step down the gradient. The size of this step is referred to as the learning rate. Let’s use a learning rate, 𝝁, of 0.1.
We update our inputs according to:
The minus sign indicates we are moving down the gradient from our starting value by an amount proportional to the slope of the loss function at that value.
Updating all our inputs and evaluating g, we find its value has decreased by 2.72!
Of course in a real network it doesn’t make sense to alter the data, instead we times each input by a weight. This times function is effectively a sub-neuron, meaning that we can propagate the gradient into it in order to learn it’s ideal value.
Let’s generalise this: We have a neuron which weights its inputs, applies some function to them, and then produces an output value.
These inputs come from neurons in the previous layer, and the outputs are passed to neurons in the next layer.
At the same time as calculating its output, the local gradients can be calculated for each free parameter in the neuron, since these are analytic functions independent of the global gradient:
Eventually the forward pass finishes, the network produces its output, and we can evaluate the loss function. We then begin the backwards pass and propagate the gradient of the loss function backwards through the network. Eventually it reaches our neuron, all that that needs to happen is we multiply the incoming gradient by the local gradient and pass it on to the next layer (chain-rule).
Once the backwards pass finishes, we take one step down the gradient by updating each free parameter in the network according to the slope of the loss function in that dimension in parameter space. By iterating this procedure, we should eventually arrive at an optimum set of parameters.
The fundamentals for this method of learning optimum weights by back-propagation was first proposed back in 1960 (Keeley), was developed over 22 years, first applied neural networks in 1982 (Werbos), and shown to be useful in training them in 1986 (Rumelhart, Hinton, & Williams). However, neural networks only started outperforming other methods in 2010; that’s a whole 28 years later! What else was missing? Find out next time in the third and final (penultimate) instalment of this gripping trilogy (quadrology)!
P.S. I should mention that having at least a basic grasp of how back-prop works really is essential for understanding some of the problems early NNs faced. I’d encourage readers to try making up there own arbitrary nets and repeating the forwards-backwards pass exercise . If you have the time, this lecture by Andrej Karpathy is very useful (as indeed are the rest in the series), and was how I learnt about back-prop. You’ll also get a sneak peek about the topics of the next post.