Backward Propagation Theory

A deep illustration to the Backward Propagation of Deep Neural Networks in a Mathematical way.

Setup and notation

Consider an L-layer Feedforward Neural Network (FNN/MLP). For layer $l=1,\dots,L$:

$n_{l}$ = number of units in layer $l$.
Input: $a^{(0)} = x \in \mathbb{R}^{n_0}$.
Linear pre-activation: $z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$, where $W^{(l)}\in\mathbb{R}^{n_l\times n_{l-1}}$, $b^{(l)}\in\mathbb{R}^{n_l}$.
Activation: $a^{(l)} = \phi^{(l)}(z^{(l)})$ (applied elementwise).
Output $a^{(L)}$. Loss for one example: $\mathcal{L} = \mathcal{L}(a^{(L)}, y)$.

We want gradients:

$$ \frac{\partial \mathcal{L}}{\partial W^{(l)}},\quad \frac{\partial \mathcal{L}}{\partial b^{(l)}},\quad \frac{\partial \mathcal{L}}{\partial a^{(l)}}. $$

Key object (error at layer $l$):

$$ \delta^{(l)} \equiv \frac{\partial \mathcal{L}}{\partial z^{(l)}} \in \mathbb{R}^{n_l}. $$

This is the derivative of loss w.r.t. the pre-activation.

The core chain-rule recurrence

By chain rule, for the output layer:

$$ \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \odot \phi^{(L)\prime}(z^{(L)}), $$

where $\odot$ denotes elementwise (Hadamard) product and $\phi'$ is the elementwise derivative.

For a hidden layer $l$ (moving backward):

$$ \delta^{(l)} = \big(W^{(l+1)\top}\, \delta^{(l+1)}\big) \odot \phi^{(l)\prime}(z^{(l)}). $$

This is the standard backprop recurrence: propagate the next layer’s error through the transpose of weights, then elementwise multiply by the local derivative.

Gradients for parameters follow directly:

$$ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)}\, a^{(l-1)\top}, \qquad \frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)} \quad(\text{or sum over mini-batch}). $$

(If using a minibatch of size $m$, let $\Delta^{(l)}\in\mathbb{R}^{n_l\times m}$ be the matrix of $\delta^{(l)}$ for each example; then $\nabla_{W^{(l)}} = \frac{1}{m}\Delta^{(l)}A^{(l-1)\top}$.)

Dimensions sanity check

$W^{(l)}$ is $n_l\times n_{l-1}$.
$a^{(l-1)}$ is $n_{l-1}\times 1$ for a single example.
$\delta^{(l)}$ is $n_l\times 1$.
$\delta^{(l)}\,a^{(l-1)\top}$ is $(n_l\times1)(1\times n_{l-1}) = n_l\times n_{l-1}$, matches $W^{(l)}$.

Important activation derivatives (elementwise)

Sigmoid: $\sigma(z)=\frac{1}{1+e^{-z}}$. $\sigma'(z)=\sigma(z)(1-\sigma(z))$.
Tanh: $\tanh'(z)=1-\tanh^2(z)$.
ReLU: $\mathrm{ReLU}(z)=\max(0,z)$. $\mathrm{ReLU}'(z)=\mathbf{1}_{z>0}$ (subgradient at 0).
Linear: $\phi(z)=z \Rightarrow \phi'(z)=1$.

Softmax + cross-entropy (common simplification)

For a final layer pre-activation $z\in\mathbb{R}^K$, softmax

$$ p_i=\mathrm{softmax}(z)_i=\frac{e^{z_i}}{\sum_j e^{z_j}}. $$

If loss is categorical cross-entropy with one-hot $y$,

$$ \mathcal{L} = -\sum_{i} y_i \log p_i, $$

then the gradient of loss w.r.t. $z$ simplifies to

$$ \delta = p - y. $$

This avoids computing the $K\times K$ Jacobian of softmax explicitly and is the standard numerical stable shortcut used in practice.

Worked numeric example (one sample)

We’ll do a tiny network: 2 inputs $\to$ 2 hidden units (sigmoid) $\to$ 1 output (linear). Use mean-squared error $\mathcal{L}=\tfrac12 (a^{(2)}-y)^2$. I’ll show the forward pass and full backprop numbers.

Parameters & input

$$ x=\begin{bmatrix}1\\[2pt]2\end{bmatrix}, \quad W^{(1)}=\begin{bmatrix}0.1 & -0.2\\[3pt] 0.4 & 0.3\end{bmatrix}, \quad b^{(1)}=\begin{bmatrix}0.0\\[2pt]0.1\end{bmatrix}, $$$$ W^{(2)}=\begin{bmatrix}0.2 & -0.5\end{bmatrix}, \quad b^{(2)}=0.0, \quad y=1.0. $$

Forward pass

$z^{(1)} = W^{(1)}x + b^{(1)}$.

$z^{(1)}_1 = 0.1\cdot1 + (-0.2)\cdot2 + 0.0 = 0.1 -0.4 = -0.3.$
$z^{(1)}_2 = 0.4\cdot1 + 0.3\cdot2 + 0.1 = 0.4 +0.6 +0.1 = 1.1.$

$a^{(1)} = \sigma(z^{(1)})$ (sigmoid elementwise).

$\sigma(-0.3) \approx 0.425557483188341$ (since $e^{0.3}\approx1.349858$, $\sigma(-0.3)=1/(1+e^{0.3})$).
$\sigma(1.1) \approx 0.7502601055951177.$

So $a^{(1)}\approx [0.4255575,\;0.7502601]^\top.$

Output pre-activation $z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$:

$$ z^{(2)} = 0.2\cdot 0.4255575 + (-0.5)\cdot 0.7502601 \approx 0.0851115 - 0.37513 \approx -0.2900185562. $$

Output $a^{(2)} = z^{(2)}$ (linear).

Loss $\mathcal{L} = \tfrac12 (a^{(2)}-y)^2 = \tfrac12(-0.29001856 - 1)^2 \approx 0.8320739376.$

Backward pass

Output layer error (linear output):

$$ \delta^{(2)} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} = a^{(2)} - y \approx -1.2900185561598907. $$

Gradients for layer 2:

$$ \nabla_{W^{(2)}} = \delta^{(2)}\,a^{(1)\top} \approx [-1.29001856\cdot 0.4255575,\; -1.29001856\cdot 0.7502601] $$$$ \Rightarrow \nabla_{W^{(2)}} \approx [-0.54897705,\; -0.96784946]. $$$$ \nabla_{b^{(2)}} = \delta^{(2)} \approx -1.2900185562. $$

Backpropagate to hidden layer: Compute sigmoid derivatives:

$$ \sigma'(z_1) = \sigma(-0.3)(1-\sigma(-0.3)) \approx 0.4255575(1-0.4255575)\approx 0.2444583117, $$$$ \sigma'(z_2) \approx 0.7502601(1-0.7502601) \approx 0.1873698795. $$

Use recurrence $\delta^{(1)} = (W^{(2)\top}\delta^{(2)})\odot \sigma'(z^{(1)})$.

$W^{(2)\top}\delta^{(2)} = \begin{bmatrix}0.2\\-0.5\end{bmatrix}(-1.2900185562) \approx \begin{bmatrix}-0.25800371\\ 0.645009278\end{bmatrix}.$
Elementwise multiply by sigmoid’ gives:

$$ \delta^{(1)} \approx \begin{bmatrix}-0.25800371\cdot 0.24445831 \\[4pt] 0.645009278\cdot 0.18736988 \end{bmatrix} \approx \begin{bmatrix}-0.0630711517 \\[4pt] 0.1208553107\end{bmatrix}. $$

Gradients for layer 1:

$$ \begin{align} \nabla_{W^{(1)}} &= \delta^{(1)} \, x^\top\\ &= \begin{bmatrix} -0.0630711517 \cdot 1 & -0.0630711517 \cdot 2\\[4pt] 0.1208553107 \cdot 1 & 0.1208553107 \cdot 2 \end{bmatrix}\\ &\approx \begin{bmatrix} -0.06307115 & -0.12614230\\[4pt] 0.12085531 & 0.24171062 \end{bmatrix}.\\ \nabla_{b^{(1)}} &= \delta^{(1)}\\ &\approx [-0.06307115,\; 0.12085531]^\top. \end{align} $$

So we have exact parameter gradients ready to pass to an optimizer.

(Those numbers are from the step-by-step arithmetic above — they match the chain-rule recurrence and dimensional checks.)

Computational perspective & implementation notes

Caching activations: during forward pass you must store $z^{(l)}$ and $a^{(l)}$ so you can compute $\phi'(z^{(l)})$ easily in backprop.
Vectorization: do backprop for a batch $m$ as matrix ops:
- If $A^{(l-1)}\in\mathbb{R}^{n_{l-1}\times m}$ and $\Delta^{(l)}\in\mathbb{R}^{n_l\times m}$ (each column is $\delta$ of a sample),
$$ \nabla_{W^{(l)}} = \frac{1}{m}\,\Delta^{(l)}\, A^{(l-1)\top},\qquad \nabla_{b^{(l)}} = \frac{1}{m}\sum_{j=1}^m \Delta^{(l)}_{:,j}. $$
Automatic differentiation (autodiff) builds the computational graph and applies the same chain rule automatically; backprop is simply efficient accumulation of the chain rule.

Why can gradients vanish or explode?

Backprop involves repeated multiplications by weight matrices and activation derivatives:

$$ \delta^{(l)} = \bigg(\prod_{k=l+1}^{L} W^{(k)\top} \,\mathrm{diag}(\phi^{(k)\prime}(z^{(k)}))\bigg) \frac{\partial\mathcal{L}}{\partial a^{(L)}}. $$

Roughly, repeated multiplication can shrink (if spectral radii <1) or grow (if >1). Activations like sigmoid/tanh have derivatives in (0,1) which can shrink signals → vanishing gradients. Large weight singular values lead to exploding gradients. Remedies: careful initialization (Xavier, He), batch normalization, skip connections (ResNets), working with activations that keep derivatives near 1 (ReLU variants), gradient clipping.

Complexity & memory

Time: a forward + backward pass costs $O(\sum_l n_{l} n_{l-1})$ (dominant matrix multiplies).
Memory: need to store activations $a^{(l)}$ and z’s for each layer during forward if you want to compute gradients after; memory is roughly $O(\sum_l n_l)$ per example (times batch size).

Summary checklist (for coding/backprop)

Forward compute and cache $z^{(l)}, a^{(l)}$.
Compute $\delta^{(L)}$ from loss and output activation derivative (use softmax+CE shortcut if applicable).
For $l=L$ down to $1$: compute gradients $\nabla_{W^{(l)}}, \nabla_{b^{(l)}}$ and compute $\delta^{(l-1)}$ via the recurrence.
Update weights using optimizer (SGD, Adam, etc.).
Watch numerical stability (use log-sum-exp for softmax, stable loss functions).

Setup and notation#

The core chain-rule recurrence#

Dimensions sanity check#

Important activation derivatives (elementwise)#

Softmax + cross-entropy (common simplification)#

Worked numeric example (one sample)#

Parameters & input#

Forward pass#

Backward pass#

Computational perspective & implementation notes#

Why can gradients vanish or explode?#

Complexity & memory#

Summary checklist (for coding/backprop)#