Skip-Connection
A deep illustration to the Skip-Connection in a Mathematical way.
1. Basic Formulation of a Residual Block
Without skip connections, a block is just:
$$ x_{l+1} = \mathcal{F}(x_l; W_l) $$With skip connections (ResNets):
$$ x_{l+1} = x_l + \mathcal{F}(x_l; W_l) $$where:
- $x_l \in \mathbb{R}^d$ is the input at layer $l$,
- $\mathcal{F}(x_l; W_l)$ is the residual function (typically a small stack of convolution, normalization, nonlinearity),
- the skip connection is the identity mapping $I(x) = x$.
2. Gradient Flow: Chain Rule Analysis
Consider a loss $\mathcal{L}$. The gradient w.r.t. input $x_l$ is:
$$ \frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_{l+1}} \cdot \frac{\partial x_{l+1}}{\partial x_l} $$For plain networks (no skip):
$$ \frac{\partial x_{l+1}}{\partial x_l} = J_{\mathcal{F}}(x_l) \quad \text{(Jacobian of }\mathcal{F}\text{)} $$For residual networks:
$$ \frac{\partial x_{l+1}}{\partial x_l} = I + J_{\mathcal{F}}(x_l) $$So, backpropagation becomes:
$$ \frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_{l+1}} (I + J_{\mathcal{F}}(x_l)) $$🔑 Effect: Instead of multiplying many Jacobians (which can shrink to 0 → vanishing gradients or explode → exploding gradients), you always retain the identity flow. This guarantees at least one “direct path” for the gradient.
3. Spectral Norm Perspective
Let’s look at the eigenvalues of the Jacobian product. In a deep network:
Plain net: $\prod_{l=1}^L J_{\mathcal{F}_l}(x_l)$ → eigenvalues shrink (vanish) or blow up (explode) exponentially.
Residual net: $\prod_{l=1}^L (I + J_{\mathcal{F}_l}(x_l))$ Eigenvalues ≈ $1 + \lambda_i(J_{\mathcal{F}_l})$. So instead of compounding multiplicatively, they compound additively. This stabilizes training.
Formally, if $\|J_{\mathcal{F}}\|_2 \ll 1$, then:
$$ \log \det \prod_{l=1}^L (I + J_{\mathcal{F}_l}) \approx \sum_{l=1}^L \operatorname{tr}(J_{\mathcal{F}_l}) $$instead of exponential blowup.
4. Optimization Landscape Flattening
Consider the effective mapping over $L$ residual layers:
$$ x_{L} = x_0 + \sum_{l=0}^{L-1} \mathcal{F}(x_l; W_l) $$This is a telescoping sum. Thus the network approximates the function:
$$ \mathcal{H}(x) = x + \Delta(x), \quad \Delta(x) = \sum_{l=0}^{L-1} \mathcal{F}(x_l; W_l) $$That means the function class is biased towards perturbations of the identity. So optimization starts near a simple solution ($\mathcal{H}(x) = x$) and explores local perturbations, avoiding the difficulty of learning entirely new mappings from scratch.
5. Differential Equation View
If we let depth $L \to \infty$, with small step size $h$, then:
$$ x_{l+1} = x_l + h \cdot \mathcal{F}(x_l; W_l) $$In the limit:
$$ \frac{dx}{dt} = \mathcal{F}(x(t), W(t)) $$So residual networks approximate ODE flows. This continuous view explains stability: the skip connection enforces dynamics similar to numerical integration (Euler method). Without the skip, the dynamics are unstable because you’re not approximating a continuous process.
6. Summary of Causes of Effectiveness
Mathematically, skip connections:
- Stabilize gradients: $\nabla \mathcal{L}$ has an identity term, preventing vanishing/exploding.
- Control eigenvalue growth: Multiplicative Jacobians → exponential instability. Residual Jacobians → additive drift.
- Bias optimization towards near-identity maps: Makes training behave like perturbing the identity rather than learning full transformations.
- Relate to ODE solvers: They discretize continuous dynamics, leading to more stable and interpretable behavior.
✨ So in short, P-chan: skip connections are effective because they mathematically guarantee stable gradient propagation, keep the spectrum of transformations near the identity, and let the network grow like a controlled perturbation of the input function.
Do you want me to also derive a toy example numerically (say, a 2-layer linear model with and without skips) so you can see concretely how the eigenvalues and gradients behave?