Backprop in Neural Networks

3 minute read

Published:

MathJax example

This article serves as a good exercise to see how forward propagation works and then how the gradients are computed to implement the backpropagation algorithm. Also, the reader will get comfortable with the computation of vector, tensor derivatives and vector/matrix calculus. A useful document can be found here for the interested reader to get familiar with tensor operations.

We will compute the gradients and derivatives of the loss function of the given neural network as shown in the Fig 1. with respect to the parameters: W1, W2, b1 and b2. Where W1, W2 are the weight matrices; and b1 and b2 are bias the vectors. Let xR2, W1R2500, b1R500, W2R5002 and b2R2. Also, show how the forward and backpropagation algorithms work.

Simple Feedforward Neural Network
Fig 1. Simple Feedforward Neural Network

Let us first compute the forward propagation. Let x be the input. The first hidden layer is computed as follows: z1=xW1+b1

We then apply a non-linear activation function to equation a1=tanh(z1)

The output layers’ activation are obtained using the following transformation z2=a1W2+b2

Finally, a softmax is applied to get:

a2=ˆy=softmax(z2)

where ˆy is the predicted output by the feedforward network

Let’s see this feedforward network through a circuit diagram as illustrated in Fig 2. Now, let’s see how the derivatives are computed with respect to the hidden nodes and bias vectors. Refer Fig 3.

Feedforward Circuit Diagram
Fig 2. Feedforward Circuit Diagram
Backpropagation Circuit Diagram
Fig 3. Backpropagation Circuit Diagram

We have to compute the derivative of the loss function with respect to W1, W2, b1 and b2 that is see the effect of these parameters on the loss function (which we actually have to minimize).

Below are the steps to compute the various gradients as shown in figure:

Loss=yln(σ(z2))+(1y)ln(1σ(z2))

Also, note that:

dσ(z2)dz=ln(σ(z2))ln(1σ(z2))

Therefore, Lossz2=yσ(z2)=yˆy

Since, σ(z2) = ˆy

z2w2=(a1W2+b2)w2=a1

z2b2=1

z2tanh(z1)=(a1W2+b2)tanh(z1)=(tanh(z1)W2+b2)tanh(z1)=z2

tanh(z1)z1=1tanh2(z1)

z1W1=(xW1+b1)W1=x

z1b1=(xW1+b1)b1=1

Finally, we can now use the chain rule to compute the effect of the four parameters namely W1, W2, b1 and b2 on the Loss function.

In what follows, (P)T indicates the transpose of some matrix or vector P.

LossW2=Lossz2z2W2=aT1(yˆy)

Lossb2=Lossz2z2b2=(yˆy)

LossW1=Lossz2z2z1z1W1=Lossz2z2tanh(z1)tanh(z1)z1z1W1=xT(1tanh2Z1)(yˆy)WT2

Lossb1=Lossz2z2z1z1W1=Lossz2z2tanh(z1)tanh(z1)z1z1b1=(1tanh2Z1)(yˆy)WT2

Hence, we have computed both the forward propagation and back propagation for the given multi-layer neural network.