A deep illustration to the Backward Propagation of Deep Neural Networks in a Mathematical way.
Setup and notation
Consider an L-layer Feedforward Neural Network (FNN/MLP). For layer $l=1,\dots,L$:
- $n_{l}$ = number of units in layer $l$.
- Input: $a^{(0)} = x \in \mathbb{R}^{n_0}$.
- Linear pre-activation: $z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$, where $W^{(l)}\in\mathbb{R}^{n_l\times n_{l-1}}$, $b^{(l)}\in\mathbb{R}^{n_l}$.
- Activation: $a^{(l)} = \phi^{(l)}(z^{(l)})$ (applied elementwise).
- Output $a^{(L)}$. Loss for one example: $\mathcal{L} = \mathcal{L}(a^{(L)}, y)$.
We want gradients:
$$ \frac{\partial \mathcal{L}}{\partial W^{(l)}},\quad \frac{\partial \mathcal{L}}{\partial b^{(l)}},\quad \frac{\partial \mathcal{L}}{\partial a^{(l)}}. $$Key object (error at layer $l$):
$$ \delta^{(l)} \equiv \frac{\partial \mathcal{L}}{\partial z^{(l)}} \in \mathbb{R}^{n_l}. $$This is the derivative of loss w.r.t. the pre-activation.
The core chain-rule recurrence
By chain rule, for the output layer:
$$ \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \odot \phi^{(L)\prime}(z^{(L)}), $$where $\odot$ denotes elementwise (Hadamard) product and $\phi'$ is the elementwise derivative.
For a hidden layer $l$ (moving backward):
$$ \delta^{(l)} = \big(W^{(l+1)\top}\, \delta^{(l+1)}\big) \odot \phi^{(l)\prime}(z^{(l)}). $$This is the standard backprop recurrence: propagate the next layer’s error through the transpose of weights, then elementwise multiply by the local derivative.
Gradients for parameters follow directly:
$$ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)}\, a^{(l-1)\top}, \qquad \frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)} \quad(\text{or sum over mini-batch}). $$(If using a minibatch of size $m$, let $\Delta^{(l)}\in\mathbb{R}^{n_l\times m}$ be the matrix of $\delta^{(l)}$ for each example; then $\nabla_{W^{(l)}} = \frac{1}{m}\Delta^{(l)}A^{(l-1)\top}$.)
Dimensions sanity check
- $W^{(l)}$ is $n_l\times n_{l-1}$.
- $a^{(l-1)}$ is $n_{l-1}\times 1$ for a single example.
- $\delta^{(l)}$ is $n_l\times 1$.
- $\delta^{(l)}\,a^{(l-1)\top}$ is $(n_l\times1)(1\times n_{l-1}) = n_l\times n_{l-1}$, matches $W^{(l)}$.
Important activation derivatives (elementwise)
- Sigmoid: $\sigma(z)=\frac{1}{1+e^{-z}}$. $\sigma'(z)=\sigma(z)(1-\sigma(z))$.
- Tanh: $\tanh'(z)=1-\tanh^2(z)$.
- ReLU: $\mathrm{ReLU}(z)=\max(0,z)$. $\mathrm{ReLU}'(z)=\mathbf{1}_{z>0}$ (subgradient at 0).
- Linear: $\phi(z)=z \Rightarrow \phi'(z)=1$.
Softmax + cross-entropy (common simplification)
For a final layer pre-activation $z\in\mathbb{R}^K$, softmax
$$ p_i=\mathrm{softmax}(z)_i=\frac{e^{z_i}}{\sum_j e^{z_j}}. $$If loss is categorical cross-entropy with one-hot $y$,
$$ \mathcal{L} = -\sum_{i} y_i \log p_i, $$then the gradient of loss w.r.t. $z$ simplifies to
$$ \delta = p - y. $$This avoids computing the $K\times K$ Jacobian of softmax explicitly and is the standard numerical stable shortcut used in practice.
Worked numeric example (one sample)
We’ll do a tiny network: 2 inputs $\to$ 2 hidden units (sigmoid) $\to$ 1 output (linear). Use mean-squared error $\mathcal{L}=\tfrac12 (a^{(2)}-y)^2$. I’ll show the forward pass and full backprop numbers.
Parameters & input
$$ x=\begin{bmatrix}1\\[2pt]2\end{bmatrix}, \quad W^{(1)}=\begin{bmatrix}0.1 & -0.2\\[3pt] 0.4 & 0.3\end{bmatrix}, \quad b^{(1)}=\begin{bmatrix}0.0\\[2pt]0.1\end{bmatrix}, $$$$ W^{(2)}=\begin{bmatrix}0.2 & -0.5\end{bmatrix}, \quad b^{(2)}=0.0, \quad y=1.0. $$Forward pass
- $z^{(1)} = W^{(1)}x + b^{(1)}$.
- $z^{(1)}_1 = 0.1\cdot1 + (-0.2)\cdot2 + 0.0 = 0.1 -0.4 = -0.3.$
- $z^{(1)}_2 = 0.4\cdot1 + 0.3\cdot2 + 0.1 = 0.4 +0.6 +0.1 = 1.1.$
- $a^{(1)} = \sigma(z^{(1)})$ (sigmoid elementwise).
- $\sigma(-0.3) \approx 0.425557483188341$ (since $e^{0.3}\approx1.349858$, $\sigma(-0.3)=1/(1+e^{0.3})$).
- $\sigma(1.1) \approx 0.7502601055951177.$
So $a^{(1)}\approx [0.4255575,\;0.7502601]^\top.$
- Output pre-activation $z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$:
Output $a^{(2)} = z^{(2)}$ (linear).
- Loss $\mathcal{L} = \tfrac12 (a^{(2)}-y)^2 = \tfrac12(-0.29001856 - 1)^2 \approx 0.8320739376.$
Backward pass
- Output layer error (linear output):
Gradients for layer 2:
$$ \nabla_{W^{(2)}} = \delta^{(2)}\,a^{(1)\top} \approx [-1.29001856\cdot 0.4255575,\; -1.29001856\cdot 0.7502601] $$$$ \Rightarrow \nabla_{W^{(2)}} \approx [-0.54897705,\; -0.96784946]. $$$$ \nabla_{b^{(2)}} = \delta^{(2)} \approx -1.2900185562. $$- Backpropagate to hidden layer: Compute sigmoid derivatives:
Use recurrence $\delta^{(1)} = (W^{(2)\top}\delta^{(2)})\odot \sigma'(z^{(1)})$.
- $W^{(2)\top}\delta^{(2)} = \begin{bmatrix}0.2\\-0.5\end{bmatrix}(-1.2900185562) \approx \begin{bmatrix}-0.25800371\\ 0.645009278\end{bmatrix}.$
- Elementwise multiply by sigmoid’ gives:
Gradients for layer 1:
$$ \begin{align} \nabla_{W^{(1)}} &= \delta^{(1)} \, x^\top\\ &= \begin{bmatrix} -0.0630711517 \cdot 1 & -0.0630711517 \cdot 2\\[4pt] 0.1208553107 \cdot 1 & 0.1208553107 \cdot 2 \end{bmatrix}\\ &\approx \begin{bmatrix} -0.06307115 & -0.12614230\\[4pt] 0.12085531 & 0.24171062 \end{bmatrix}.\\ \nabla_{b^{(1)}} &= \delta^{(1)}\\ &\approx [-0.06307115,\; 0.12085531]^\top. \end{align} $$So we have exact parameter gradients ready to pass to an optimizer.
(Those numbers are from the step-by-step arithmetic above — they match the chain-rule recurrence and dimensional checks.)
Computational perspective & implementation notes
Caching activations: during forward pass you must store $z^{(l)}$ and $a^{(l)}$ so you can compute $\phi'(z^{(l)})$ easily in backprop.
Vectorization: do backprop for a batch $m$ as matrix ops:
- If $A^{(l-1)}\in\mathbb{R}^{n_{l-1}\times m}$ and $\Delta^{(l)}\in\mathbb{R}^{n_l\times m}$ (each column is $\delta$ of a sample),
Automatic differentiation (autodiff) builds the computational graph and applies the same chain rule automatically; backprop is simply efficient accumulation of the chain rule.
Why can gradients vanish or explode?
Backprop involves repeated multiplications by weight matrices and activation derivatives:
$$ \delta^{(l)} = \bigg(\prod_{k=l+1}^{L} W^{(k)\top} \,\mathrm{diag}(\phi^{(k)\prime}(z^{(k)}))\bigg) \frac{\partial\mathcal{L}}{\partial a^{(L)}}. $$Roughly, repeated multiplication can shrink (if spectral radii <1) or grow (if >1). Activations like sigmoid/tanh have derivatives in (0,1) which can shrink signals → vanishing gradients. Large weight singular values lead to exploding gradients. Remedies: careful initialization (Xavier, He), batch normalization, skip connections (ResNets), working with activations that keep derivatives near 1 (ReLU variants), gradient clipping.
Complexity & memory
- Time: a forward + backward pass costs $O(\sum_l n_{l} n_{l-1})$ (dominant matrix multiplies).
- Memory: need to store activations $a^{(l)}$ and z’s for each layer during forward if you want to compute gradients after; memory is roughly $O(\sum_l n_l)$ per example (times batch size).
Summary checklist (for coding/backprop)
- Forward compute and cache $z^{(l)}, a^{(l)}$.
- Compute $\delta^{(L)}$ from loss and output activation derivative (use softmax+CE shortcut if applicable).
- For $l=L$ down to $1$: compute gradients $\nabla_{W^{(l)}}, \nabla_{b^{(l)}}$ and compute $\delta^{(l-1)}$ via the recurrence.
- Update weights using optimizer (SGD, Adam, etc.).
- Watch numerical stability (use log-sum-exp for softmax, stable loss functions).