DeepLearning

Pseudo Inverse

Reprinted and translated from this Post. All Copyrights Reserved. Many readers, like myself, may find themselves with a feeling of both familiarity and strangeness when it comes to the low-rank approximation of matrices. This familiarity stems from the fact that the concept and significance of low-rank approximation are straightforward to grasp. Moreover, with the proliferation of fine-tuning techniques built upon it, such as LoRA, the idea has become deeply ingrained through constant exposure. Nevertheless, the sheer breadth of content covered by low-rank approximation, coupled with the frequent appearance of unfamiliar yet astonishing new techniques in related papers, results in a sense of not-quite-knowing. ...

The Gradient Descent within the Router of MoE

The process of Gradient Descent for training the router layer (gating network) of a Mixture-of-Experts (MoE) model involves calculating the gradient of the total loss with respect to the router’s parameters and using this to update the parameters. The primary challenge in MoE training is the non-differentiability introduced by the Top-$K$ routing mechanism, which discretely selects experts. Standard backpropagation struggles with this non-smooth operation. 1. MoE Layer Output and Loss Function MoE Layer Output ($y$) For a given input vector $\mathbf{x}$, the router layer, typically a linear projection followed by a Softmax, produces unnormalized scores (logits) $\mathbf{h}(\mathbf{x}) = \mathbf{W}_g \mathbf{x}$, where $\mathbf{W}_g$ are the router’s parameters. These logits are often passed through a Softmax function to get the expert weights (or “gates”) $\mathbf{G}(\mathbf{x})$: ...

Injective Transformers Reasoning

Reasoning for the under-hood theory behind injective transformers. Deprecated for Redundant Mathematics 1 — Notation and setup Vocabulary: $ \mathcal{V} $ with $ |\mathcal{V}|=V $. Tokens: a true input sequence $ s = (s_1,\dots,s_T $$, $ s_t\in\mathcal{V} $. Prefix at step ($ t $): $ p_{t} = (s_1,\dots,s_{t-1} $$. Transformer (deterministic $forward mapping from a token sequence to layer-$ \ell $hidden states: $$ \Phi^\ell : \mathcal{V}^T \to \mathbb{R}^{T\times d},\qquad \Phi^\ell(s $= H^\ell = [h^\ell_1,\dots,h^\ell_T]^\top $$ where $ h^\ell_t\in\mathbb{R}^{d} $ is the hidden state at position $ t $ and layer $ \ell $. Observed hidden states (from system/leak): $ \widetilde H^\ell = \Phi^\ell(s $$ (assume exact for theory; later we add noise/quantization). For brevity drop layer superscript when fixed: $ \Phi,; h_t $. Two contrasts we will study for decoding token $ s_t $ given prefix $ p_t $ and observed hidden $ h_t $: ...

Backward Propagation Theory

A deep illustration to the Backward Propagation of Deep Neural Networks in a Mathematical way. Setup and notation Consider an L-layer Feedforward Neural Network (FNN/MLP). For layer $l=1,\dots,L$: $n_{l}$ = number of units in layer $l$. Input: $a^{(0)} = x \in \mathbb{R}^{n_0}$. Linear pre-activation: $z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$, where $W^{(l)}\in\mathbb{R}^{n_l\times n_{l-1}}$, $b^{(l)}\in\mathbb{R}^{n_l}$. Activation: $a^{(l)} = \phi^{(l)}(z^{(l)})$ (applied elementwise). Output $a^{(L)}$. Loss for one example: $\mathcal{L} = \mathcal{L}(a^{(L)}, y)$. We want gradients: ...

Skip-Connection Theory

Skip-Connection A deep illustration to the Skip-Connection in a Mathematical way. 1. Basic Formulation of a Residual Block Without skip connections, a block is just: $$ x_{l+1} = \mathcal{F}(x_l; W_l) $$With skip connections (ResNets): $$ x_{l+1} = x_l + \mathcal{F}(x_l; W_l) $$where: $x_l \in \mathbb{R}^d$ is the input at layer $l$, $\mathcal{F}(x_l; W_l)$ is the residual function (typically a small stack of convolution, normalization, nonlinearity), the skip connection is the identity mapping $I(x) = x$. 2. Gradient Flow: Chain Rule Analysis Consider a loss $\mathcal{L}$. The gradient w.r.t. input $x_l$ is: ...