Skip-Connection Theory

Skip-Connection A deep illustration to the Skip-Connection in a Mathematical way. 1. Basic Formulation of a Residual Block Without skip connections, a block is just: $$ x_{l+1} = \mathcal{F}(x_l; W_l) $$With skip connections (ResNets): $$ x_{l+1} = x_l + \mathcal{F}(x_l; W_l) $$where: $x_l \in \mathbb{R}^d$ is the input at layer $l$, $\mathcal{F}(x_l; W_l)$ is the residual function (typically a small stack of convolution, normalization, nonlinearity), the skip connection is the identity mapping $I(x) = x$. 2. Gradient Flow: Chain Rule Analysis Consider a loss $\mathcal{L}$. The gradient w.r.t. input $x_l$ is: ...

September 22, 2025 · 3 min · 532 words · xxraincandyxx

Auxiliary-Loss-Free Load Balancing

Implementation Guide Preliminary: the Original Paper of DeepSeekV3 Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term $b_i$ for each expert and add it to the corresponding affinity scores $s_{i, t}$ to determine the top-K routing: ...

June 9, 2025 · 12 min · 2473 words · xxraincandyxx