The Gradient Descent within the Router of MoE

The process of Gradient Descent for training the router layer (gating network) of a Mixture-of-Experts (MoE) model involves calculating the gradient of the total loss with respect to the router’s parameters and using this to update the parameters. The primary challenge in MoE training is the non-differentiability introduced by the Top-$K$ routing mechanism, which discretely selects experts. Standard backpropagation struggles with this non-smooth operation. 1. MoE Layer Output and Loss Function MoE Layer Output ($y$) For a given input vector $\mathbf{x}$, the router layer, typically a linear projection followed by a Softmax, produces unnormalized scores (logits) $\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x}$, where $\mathbf{W}_g$ are the router’s parameters. These logits are often passed through a Softmax function to get the expert weights (or “gates”) $\mathbf{G}(\mathbf{x})$: ...

November 6, 2025 · 10 min · 1929 words · xxraincandyxx

Injective Transformers Reasoning

Reasoning for the under-hood theory behind injective transformers. Deprecated for Redundant Mathematics 1 — Notation and setup Vocabulary: $ \mathcal{V} $ with $ |\mathcal{V}|=V $. Tokens: a true input sequence $ s = (s_1,\dots,s_T $$, $ s_t\in\mathcal{V} $. Prefix at step ($ t $): $ p_{t} = (s_1,\dots,s_{t-1} $$. Transformer (deterministic $forward mapping from a token sequence to layer-$ \ell $hidden states: $$ \Phi^\ell : \mathcal{V}^T \to \mathbb{R}^{T\times d},\qquad \Phi^\ell(s $= H^\ell = [h^\ell_1,\dots,h^\ell_T]^\top $$ where $ h^\ell_t\in\mathbb{R}^{d} $ is the hidden state at position $ t $ and layer $ \ell $. Observed hidden states (from system/leak): $ \widetilde H^\ell = \Phi^\ell(s $$ (assume exact for theory; later we add noise/quantization). For brevity drop layer superscript when fixed: $ \Phi,; h_t $. Two contrasts we will study for decoding token $ s_t $ given prefix $ p_t $ and observed hidden $ h_t $: ...

October 29, 2025 · 14 min · 2827 words · xxraincandyxx

Skip-Connection Theory

Skip-Connection A deep illustration to the Skip-Connection in a Mathematical way. 1. Basic Formulation of a Residual Block Without skip connections, a block is just: $$ x_{l+1} = \mathcal{F}(x_l; W_l) $$With skip connections (ResNets): $$ x_{l+1} = x_l + \mathcal{F}(x_l; W_l) $$where: $x_l \in \mathbb{R}^d$ is the input at layer $l$, $\mathcal{F}(x_l; W_l)$ is the residual function (typically a small stack of convolution, normalization, nonlinearity), the skip connection is the identity mapping $I(x) = x$. 2. Gradient Flow: Chain Rule Analysis Consider a loss $\mathcal{L}$. The gradient w.r.t. input $x_l$ is: ...

September 22, 2025 · 3 min · 532 words · xxraincandyxx

Auxiliary-Loss-Free Load Balancing

Implementation Guide Preliminary: the Original Paper of DeepSeekV3 Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term $b_i$ for each expert and add it to the corresponding affinity scores $s_{i, t}$ to determine the top-K routing: ...

June 9, 2025 · 12 min · 2473 words · xxraincandyxx

Normalization

Q: I am building a ViT-like transformer-based image classification model, where do you recommend for me to place the normalization layer and which type of it you suggest would me the most efficient? A: Excellent question! This is a critical design choice that directly impacts training stability, convergence speed, and final model performance. Let’s break down the placement and the type of normalization layer for your ViT-like model. Summary Recommendation For the best combination of training stability and performance, the standard and highly recommended approach is: ...

June 7, 2025 · 6 min · 1189 words · xxraincandyxx