Backprop in Neural Networks
Published:
This article serves as a good exercise to see how forward propagation works and then how the gradients are computed to implement the backpropagation algorithm. Also, the reader will get comfortable with the computation of vector, tensor derivatives and vector/matrix calculus. A useful document can be found here for the interested reader to get familiar with tensor operations.
We will compute the gradients and derivatives of the loss function of the given neural network as shown in the Fig 1. with respect to the parameters: W1, W2, b1 and b2. Where W1, W2 are the weight matrices; and b1 and b2 are bias the vectors. Let x∈R2, W1∈R2∗500, b1∈R500, W2∈R500∗2 and b2∈R2. Also, show how the forward and backpropagation algorithms work.

Let us first compute the forward propagation. Let x be the input. The first hidden layer is computed as follows: z1=xW1+b1
We then apply a non-linear activation function to equation a1=tanh(z1)
The output layers’ activation are obtained using the following transformation z2=a1W2+b2
Finally, a softmax is applied to get:
a2=ˆy=softmax(z2)
where ˆy is the predicted output by the feedforward network
Let’s see this feedforward network through a circuit diagram as illustrated in Fig 2. Now, let’s see how the derivatives are computed with respect to the hidden nodes and bias vectors. Refer Fig 3.


We have to compute the derivative of the loss function with respect to W1, W2, b1 and b2 that is see the effect of these parameters on the loss function (which we actually have to minimize).
Below are the steps to compute the various gradients as shown in figure:
Loss=y∗ln(σ(z2))+(1−y)∗ln(1−σ(z2))
Also, note that:
dσ(z2)dz=ln(σ(z2))∗ln(1−σ(z2))
Therefore, ∂Loss∂z2=y−σ(z2)=y−ˆy
Since, σ(z2) = ˆy
∂z2∂w2=∂(a1W2+b2)∂w2=a1
∂z2∂b2=1
∂z2∂tanh(z1)=∂(a1W2+b2)∂tanh(z1)=∂(tanh(z1)W2+b2)∂tanh(z1)=z2
∂tanh(z1)∂z1=1−tanh2(z1)
∂z1∂W1=∂(xW1+b1)∂W1=x
∂z1∂b1=∂(xW1+b1)∂b1=1
Finally, we can now use the chain rule to compute the effect of the four parameters namely W1, W2, b1 and b2 on the Loss function.
In what follows, (P)T indicates the transpose of some matrix or vector P.
∂Loss∂W2=∂Loss∂z2∗∂z2∂W2=aT1(y−ˆy)
∂Loss∂b2=∂Loss∂z2∗∂z2∂b2=(y−ˆy)
∂Loss∂W1=∂Loss∂z2∗∂z2∂z1∗∂z1∂W1=∂Loss∂z2∗∂z2∂tanh(z1)∗∂tanh(z1)∂z1∗∂z1∂W1=xT(1−tanh2Z1)(y−ˆy)WT2
∂Loss∂b1=∂Loss∂z2∗∂z2∂z1∗∂z1∂W1=∂Loss∂z2∗∂z2∂tanh(z1)∗∂tanh(z1)∂z1∗∂z1∂b1=(1−tanh2Z1)(y−ˆy)WT2
Hence, we have computed both the forward propagation and back propagation for the given multi-layer neural network.