Transformers

The process of Gradient Descent for training the router layer (gating network) of a Mixture-of-Experts (MoE) model involves calculating the gradient of the total loss with respect to the router’s parameters and using this to update the parameters. The primary challenge in MoE training is the non-differentiability introduced by the Top-$K$ routing mechanism, which discretely selects experts. Standard backpropagation struggles with this non-smooth operation. 1. MoE Layer Output and Loss Function MoE Layer Output ($y$) For a given input vector $\mathbf{x}$, the router layer, typically a linear projection followed by a Softmax, produces unnormalized scores (logits) $\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x}$, where $\mathbf{W}_g$ are the router’s parameters. These logits are often passed through a Softmax function to get the expert weights (or “gates”) $\mathbf{G}(\mathbf{x})$: ...

Reasoning for the under-hood theory behind injective transformers. Deprecated for Redundant Mathematics 1 — Notation and setup Vocabulary: $ \mathcal{V} $ with $ |\mathcal{V}|=V $. Tokens: a true input sequence $ s = (s_1,\dots,s_T $$, $ s_t\in\mathcal{V} $. Prefix at step ($ t $): $ p_{t} = (s_1,\dots,s_{t-1} $$. Transformer (deterministic $forward mapping from a token sequence to layer-$ \ell $hidden states: $$ \Phi^\ell : \mathcal{V}^T \to \mathbb{R}^{T\times d},\qquad \Phi^\ell(s $= H^\ell = [h^\ell_1,\dots,h^\ell_T]^\top $$ where $ h^\ell_t\in\mathbb{R}^{d} $ is the hidden state at position $ t $ and layer $ \ell $. Observed hidden states (from system/leak): $ \widetilde H^\ell = \Phi^\ell(s $$ (assume exact for theory; later we add noise/quantization). For brevity drop layer superscript when fixed: $ \Phi,; h_t $. Two contrasts we will study for decoding token $ s_t $ given prefix $ p_t $ and observed hidden $ h_t $: ...

Transformers

The Gradient Descent within the Router of MoE

Injective Transformers Reasoning