The Gradient Descent within the Router of MoE
The process of Gradient Descent for training the router layer (gating network) of a Mixture-of-Experts (MoE) model involves calculating the gradient of the total loss with respect to the router’s parameters and using this to update the parameters. The primary challenge in MoE training is the non-differentiability introduced by the Top-$K$ routing mechanism, which discretely selects experts. Standard backpropagation struggles with this non-smooth operation. 1. MoE Layer Output and Loss Function MoE Layer Output ($y$) For a given input vector $\mathbf{x}$, the router layer, typically a linear projection followed by a Softmax, produces unnormalized scores (logits) $\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x}$, where $\mathbf{W}_g$ are the router’s parameters. These logits are often passed through a Softmax function to get the expert weights (or “gates”) $\mathbf{G}(\mathbf{x})$: ...