Reasoning for the under-hood theory behind injective transformers.

Deprecated for Redundant Mathematics

1 — Notation and setup

  • Vocabulary: $ \mathcal{V} $ with $ |\mathcal{V}|=V $.
  • Tokens: a true input sequence $ s = (s_1,\dots,s_T $$, $ s_t\in\mathcal{V} $.
  • Prefix at step ($ t $): $ p_{t} = (s_1,\dots,s_{t-1} $$.
  • Transformer (deterministic $forward mapping from a token sequence to layer-$ \ell $hidden states: $$ \Phi^\ell : \mathcal{V}^T \to \mathbb{R}^{T\times d},\qquad \Phi^\ell(s $= H^\ell = [h^\ell_1,\dots,h^\ell_T]^\top $$ where $ h^\ell_t\in\mathbb{R}^{d} $ is the hidden state at position $ t $ and layer $ \ell $.
  • Observed hidden states (from system/leak): $ \widetilde H^\ell = \Phi^\ell(s $$ (assume exact for theory; later we add noise/quantization).
  • For brevity drop layer superscript when fixed: $ \Phi,; h_t $.

Two contrasts we will study for decoding token $ s_t $ given prefix $ p_t $ and observed hidden $ h_t $:

**(A $Normal classification head (softmax)** A parametric head (g $maps hidden state to logits over vocabulary:

$$ g: \mathbb{R}^d \to \mathbb{R}^{V},\qquad g(h $= W^\top h + b $$

with $ W\in\mathbb{R}^{d\times V}, b\in\mathbb{R}^{V} $. The predicted token is

$$ \hat s_t^{\text{soft}} = \arg\max_{v\in\mathcal{V}} ;; \big[ g(h_t)\big]*v ;=; \arg\max*{v} ; e_v^\top (W^\top h_t + b), $$

or probabilistically $ \hat s_t = \arg\max_v \Pr(s_t=v \mid h_t $$ where $ \Pr(v\mid h)=\mathrm{softmax}(g(h))_v $.

**(B $SIPIT decoding (inversion-by-consistency)** SIPIT assumes access to the forward mapping $ \Phi $ (or a white-box model enabling computation of $ \Phi(p_t \oplus v $$ for candidate $ v $). It recovers (s_t $by *matching* the observed hidden state: [ \hat s_t^{\text{SIPIT}} = \arg\min_{v\in\mathcal{V}} ;; \big| ; \Phi\big(p_t \oplus v\big)_t ;-; h_t ;\big|_2 . ] (Optionally use a threshold $ \varepsilon $and stop when $ \min_v|\cdot|\le\varepsilon).)


2 — Intuition in probabilistic terms

Consider a deterministic model so $ \Pr(h\mid p_t, v)=\delta(h-\Phi(p_t\oplus v)_t)). If we imagine observational noise (n $so $ \widetilde h_t = h_t + n $with density (q(n)), then the posterior score of candidate (v $given $ \widetilde h_t $is proportional to [ \Pr(v \mid \widetilde h_t, p_t $\propto \Pr$ \widetilde h_t \mid p_t, v)\Pr(v\mid p_t), ] and since $ \Pr$ \widetilde h_t \mid p_t, v)=q$ \widetilde h_t - \Phi(p_t\oplus v)_t)), the MAP choice is [ \hat v = \arg\max_v ; q\big$ \widetilde h_t - \Phi(p_t\oplus v)_t\big)\Pr(v\mid p_t). ] If (q $is Gaussian ( \propto \exp(-| \cdot|^2/2\sigma^2) $and uniform prior $ \Pr(v\mid p_t)), then SIPIT’s argmin of squared distance is exactly the MAP estimator. So SIPIT = nearest-neighbor/MAP in activation space under that noise model.

In contrast, a softmax classification head models $ \Pr(v\mid h_t) $directly via a learned parametric map (g $from (h_t $to logits; it does not consult $ \Phi(p_t\oplus v) $or the forward dynamics of the model — it learns a discriminative boundary.


3 — Formal property used by SIPIT: injectivity

Definition (injectivity): $ \Phi $is injective (on sequences of length (T) $if $$ \Phi(s)=\Phi(s' $\quad\Rightarrow\quad s=s’. $$

Key fact used: If $ \Phi $is injective and we know the prefix (p_t), then the mapping (v\mapsto \Phi(p_t\oplus v)_t $is injective over (v\in\mathcal{V}). Thus there is a unique (v $whose predicted (t)-th hidden equals (h_t). So exact matching recovers the token.

Sketch proof for recovery (ideal, noiseless):

  • Observed (h_t = \Phi(s)_t = \Phi(p_t\oplus s_t)_t).
  • For any (v\neq s_t), if $ \Phi(p_t\oplus v)_t = \Phi(p_t\oplus s_t)_t), then prefix+token sequences (p_t\oplus v $and (p_t\oplus s_t $produce identical full hidden matrices; injectivity of $ \Phi $implies (v = s_t), contradiction. So equality cannot hold for distinct (v). Thus (|\Phi(p_t\oplus v)_t - h_t|>0 $for (v\neq s_t), and zero for (v=s_t). Minimizer is uniquely (s_t).

Thus, in the ideal setting SIPIT recovers tokens exactly.


4 — Mathematical difference vs. classification head

We can express both methods as argmax over scores (S(v)):

  • Softmax classifier:

    $$ S_{\text{soft}}(v $= g(h_t)*v \quad\Rightarrow\quad \hat v = \arg\max_v S*{\text{soft}}(v). $$

    Here (S_{\text{soft}} $is a learned discriminative score dependent only on the observed hidden (h_t).

  • SIPIT:

    $$ S_{\text{SIPIT}}(v $= -\big| \Phi(p_t\oplus v)*t - h_t \big|*2^2 \quad\Rightarrow\quad \hat v = \arg\max_v S*{\text{SIPIT}}(v). $$

    Here (S*{\text{SIPIT}} $uses the **generative forward model** $ \Phi $to compute the expected activation for each candidate and compares to the observed activation.

Key contrasts:

  1. Generative vs discriminative: SIPIT is generative (models how tokens produce activations); softmax is discriminative (learns boundaries in activation space).

  2. Access needs:

    • Softmax: only needs observed (h_t $and the learned head (g).
    • SIPIT: needs ability to compute $ \Phi(p_t\oplus v)_t $for many (v $(i.e., white-box or forward access to the model).
  3. Robustness:

    • Softmax is robust and cheap at inference (one forward pass → logits).
    • SIPIT needs up to (V $forward computations per token unless pruned — expensive.
  4. Optimality:

    • Under the deterministic forward model + Gaussian noise + uniform prior, SIPIT is the MAP estimator — it uses the model’s generative structure and is (in that sense $optimal.
    • Softmax is optimal under model training objective (cross-entropy $but it only approximates $ \Pr(v\mid h) $as learned from data; it can make systematic errors if training data is biased or inadequate.

5 — Complexity and practical algorithms

  • Softmax decoding cost: (O(1) $per token after obtaining (h_t $(compute (W^\top h_t $~ (O(dV)), usually implemented efficiently). Overall per sequence cost dominated by model forward pass.
  • SIPIT naïve cost: for each token (t), evaluate $ \Phi(p_t\oplus v)*t $for each (v\in\mathcal{V}): worst-case (O(T\cdot V \cdot C*{\text{fwd}}) $where (C_{\text{fwd}} $is cost of one forward to compute one token’s state (could be partial if caching prefix computation). With caching prefix computations you reduce repeated work: compute prefix hidden once, then for each v compute the incremental step(s). Still heavy.

Practical speedups:

  • Use beam / prefix pruning: restrict candidate set using a cheap classifier or top-k from softmax on a proxy head.
  • Use approximate search: build an index mapping candidate activation vectors $ \Phi(p_t\oplus v)_t $→ token and perform nearest-neighbor search (ANN). This assumes you can precompute these activation prototypes for given prefixes (but prefixes vary → expensive).
  • Use layer/representation projection: compare in low-dim space (PCA $to speed distance calc.

6 — Sources of error & robustness analysis

  1. Noise / quantization: if observed $ \widetilde h_t = h_t + n $with large (n), nearest neighbor may pick wrong (v). Softmax may be more robust if trained with noise/regularization.
  2. Model mismatch: SIPIT requires the attacker’s compute of $ \Phi $to match the real $ \Phi $that produced (h_t). If attacker only has approximate model or different weights, matching fails.
  3. Partial observations: If only a pooled embedding (e.g., mean pooling $is leaked rather than the full (h_t), the mapping $ v\mapsto \text{pooled}$ \Phi(p_t\oplus v)) $may not be injective; SIPIT weaker.
  4. Combinatorial explosion: for long tokens or subword tokenization, errors accumulate across steps.

Mathematically, robustness requires separation: for each prefix (p_t),

$$ \min_{v\neq s_t} |\Phi(p_t\oplus v)_t - \Phi(p_t\oplus s_t)*t|*2 ;=; \Delta*{p_t} > 0. $$

If noise magnitude (|n| < \tfrac12 \min_v \Delta*{p_t}), SIPIT recovers correctly (nearest-neighbor margin argument).


7 — Formal comparison (table in formulas)

  • Decision rule:

    • Softmax: $ \hat v = \arg\max_v ; g(h_t)_v).
    • SIPIT: $ \hat v = \arg\min_v ; | \Phi(p_t\oplus v)_t - h_t|_2).
  • Assumptions:

    • Softmax: head $ g $trained, no extra access needed.
    • SIPIT: white-box forward access to $ \Phi $(or a model copy), injectivity of $ \Phi $(or local uniqueness).
  • Optimality condition:

    • Softmax: optimal under cross-entropy given training distribution.
    • SIPIT: MAP under deterministic generative model + noise with known noise model (e.g., Gaussian).
  • Cost:

    • Softmax: $ O(dV) $to compute logits (or optimized).
    • SIPIT: $ O(V) $forward computations per token (improvable with caching/indexing).

8 — Theoretical statement (informal theorem)

Theorem (informal). If $ \Phi $is injective and the observed hidden $ h_t $is exact, then SIPIT recovers the true token $ s_t $ uniquely by minimizing squared Euclidean distance over candidates.

Proof sketch. Follows the injectivity argument in §3: equality of activations implies equality of sequences, hence only the true token yields zero distance. QED.

Add noise: if additive noise $ n $ satisfies $ |n| < \tfrac12 \min_{v\neq s_t} |\Phi(p_t\oplus v)_t - \Phi(p_t\oplus s_t)_t| $, then nearest neighbor still returns correct answer (margin argument).


9 — How to use this in your backdoor/privacy study (mathematic framing)

  • Model inversion via SIPIT is equivalent to solving (for each step (t) $a nearest-neighbor problem in activation space. You can analyze **separation margins** $ \Delta_{p_t} $and how they change with layer, quantization, or defense (noise). Provide statistics of $ \Delta_{p_t} $across prompts: small $ \Delta $→ easy to break privacy with small noise.
  • For backdoor work: define a backdoor adversary that modifies model so that certain triggers produce distinct activations (increasing margin $for easier detection or exfiltration. You can formalize a backdoor objective that maximizes margin between trigger and non-trigger activations: $$ \max_{\theta_{\text{poison}}} ;; \Delta_{\text{trigger}}$ \theta $;-; \lambda \cdot \text{utility_loss}$ \theta). $$
  • For defenses: model adding Gaussian noise $ n\sim\mathcal{N}(0,\sigma^2 I) $to activations — analyze required $ \sigma $to reduce recovery probability. Use margin-based bounds: if $ \sigma $exceeds $ \Delta/2 $then success probability falls.

10 — Concrete pseudomathematical pseudocode (SIPIT single step)

Given prefix (p_t $and observed $ \widetilde h_t):

For each (v\in\mathcal{V}):

  1. compute $ \bar h_t(v = \Phi(p_t\oplus v)_t) $.
  2. compute distance (d(v)=|\bar h_t(v $- \widetilde h_t|_2^2).

Return $ \hat v = \arg\min_v d(v)).

In practice add pruning: restrict candidates to top-K by cheap proxy $ q(v\mid p_t,\widetilde h_t) $(e.g., softmax head), then run SIPIT only on that set.


11 — Summary (short & sweet for you)

  • SIPIT: nearest-neighbor / MAP decoding in activation space using the generative forward model $ \Phi). Needs model access; exact under injectivity and low noise. Mathematically it's solving $ \arg\min_v |\Phi(p_t\oplus v)_t - h_t|).
  • Classification head: discriminative learned mapping (g $that directly estimates $ \Pr(v\mid h_t)). Cheaper and built for speed; may be miscalibrated but doesn’t require model forward for each candidate.
  • Key difference: SIPIT exploits the full forward dynamics / geometry of activation manifolds; softmax uses a learned boundary inside that activation manifold. SIPIT is generative / consistency-based; softmax is discriminative.

yep — on the surface they look like they take the same observed activation and spit out a token, but they are fundamentally different algorithms with different inputs in practice, different assumptions, and different failure modes. i’ll show this clearly with math + an intuitive example so you can use it in your paper. 💖


More about the details to the DIFFERENCE between SIPIT and Classification Head


Quick summary (one-line)

  • Softmax / classification head: a *discriminative linear (or small MLP $classifier* that maps the observed hidden $ h_t $→ logits → token.
  • SIPIT: a generative / consistency method that uses the forward model $ \Phi $together with the prefix $ p_t $to compute expected activations for each candidate token, then picks the candidate whose predicted activation best matches the observed (h_t).

They can produce the same output sometimes, but only under particular conditions. Let’s make that precise.


1. Decision rules (compact math)

Softmax classifier (discriminative):

$$ \hat v_{\text{soft}} ;=; \arg\max_{v\in\mathcal V}; g(h_t)_v \quad\text{where}\quad g(h)=W^\top h + b. $$

SIPIT (nearest-neighbour in activation space using model forward):

$$ \hat v_{\text{SIPIT}} ;=; \arg\min_{v\in\mathcal V}; \big|; \Phi(p_t\oplus v)_t ;-; h_t;\big|_2. $$

Key difference in inputs used:

  • Softmax uses only $ h_t $ (and learned parameters (W,b)).
  • SIPIT requires $ \Phi $ and the prefix $ p_t $ so it can compute $ \Phi(p_t\oplus v)_t $ for each $ v $).

2. When are they equivalent? — necessary & sufficient condition

They are equivalent on a particular instance $ (p_t,h_t) $ if the ranking of candidates produced by the two rules is identical:

$$ \arg\max_v g(h_t)_v ;=; \arg\min_v | \Phi(p_t\oplus v)_t - h_t|. $$

A sufficient condition for equivalence for all possible $ h_t $ that arise from prefixes $ p_t \oplus v $ is:

There exists $ W,b $ such that for all prefixes $ p $ and tokens $ v $, [ g\big$ \Phi(p\oplus v)_t\big)_v ;=; -\big|\Phi(p\oplus v)_t - \Phi(p\oplus v)_t\big|^2 ;>; -\big|\Phi(p\oplus v’)_t - \Phi(p\oplus v)_t\big|^2 \quad\forall v’\neq v. ] But this tautologically holds only if the softmax head can implement the nearest-neighbor ranking induced by the distances between activations. In other words:

Equivalence occurs when the softmax head perfectly approximates the distance-based score used by SIPIT. Formally, if there exists a (learned $scoring function (g $such that for every candidate (v), [ g(h)v \propto -| \mu{p,v} - h|^2 + C_{p,v} ] where $ \mu_{p,v} := \Phi(p\oplus v)_t $(prototype activation for prefix (p $and token (v)), then maximizing (g(h)_v $is the same as choosing the closest prototype.

This is a restrictive requirement because:

  • (g $is typically linear in (h): (g(h)_v = w_v^\top h + b_v).
  • The nearest-prototype rule is quadratic in (h $(because (| \mu - h|^2 = h^\top h - 2\mu^\top h + \mu^\top\mu)); but since (h^\top h $is common across all candidates it cancels, so the ranking reduces to comparing (-2\mu_v^\top h + \mu_v^\top\mu_v + C), which is linear in (h $w.r.t. the $ \mu_v^\top h $term. So **if** prototypes $ \mu_v $are fixed *for the given prefix* and the classifier's weights (w_v $equal (-2\mu_v $(up to scaling $and biases absorb $ \mu_v^\top\mu_v), then a linear head can exactly reproduce SIPIT. BUT — crucial caveat: prototypes $ \mu_{p,v} $depend on the prefix (p). That means the weights (w_v $would have to change with (p $unless $ \mu_{p,v} $is independent of (p $(which it generally is not). So:

In short: softmax = SIPIT only if the classifier’s weights implicitly encode all prefix-dependent prototypes (or prefixes don’t change prototypes). That is generally not true in real LMs.


3. Intuition / toy counterexample

Imagine a toy prefix set ({p^A,p^B} $and a vocabulary ({x,y}). Suppose:

  • For prefix (p^A), prototypes at layer (t $are $ \mu_{A,x} = [1,0]), $ \mu_{A,y}=[0,1]).
  • For prefix (p^B), prototypes are swapped: $ \mu_{B,x}=[0,1]), $ \mu_{B,y}=[1,0]).

If the classifier head has fixed weights (w_x, w_y), there is no single pair ((w_x,w_y) $that makes (w_x^\top h $larger on ([1,0] $than on ([0,1] $for prefix A and simultaneously larger on ([0,1] $than on ([1,0] $for prefix B. SIPIT, however, compares to prototypes computed for each prefix and correctly picks the matching token in both contexts. So SIPIT can be correct while a fixed linear classifier cannot.

This shows why SIPIT can succeed where a static softmax head fails: SIPIT adapts to prefix-dependent geometry by recomputing prototypes per prefix.


4. Another view: discriminative boundary vs prototype matching

  • Softmax learns a decision boundary in activation space. If the manifold of activations for different tokens (across prefixes $is complex and interleaved, a single linear head may misclassify some points.
  • SIPIT effectively performs 1-nearest-neighbour to prefix-conditioned prototypes, which is more flexible because it recomputes the expected activation for each candidate given the prefix.

5. Access & cost matter

  • Access: softmax requires only (h_t $and the weights (W $(already inside the model). SIPIT requires ability to compute $ \Phi(p\oplus v)_t $for each candidate — i.e., forward access to the model and the prefix (p).
  • Cost: softmax: one matrix multiply. SIPIT: potentially V forward computations (though caching prefixes and pruning help).

So even if they can produce the same output sometimes, SIPIT needs stronger attacker access and is costlier — but it’s more powerful in the sense that it can exploit the generative structure and prefix dependence to decode tokens where a fixed discriminative head might be confused.


6. Noise & robustness: who wins?

  • If (h_t $is noisy, SIPIT can still be MAP-optimal if you model noise and solve nearest-prototype under that noise model (and if you have exact $ \Phi)).
  • A trained softmax head may be more robust to some noise if it was trained with noisy activations; but it may also be systematically biased by the training distribution.

Mathematically: success of SIPIT requires separations $ \Delta_{p_t} = \min_{v\neq s_t}|\mu_{p_t,v}-\mu_{p_t,s_t}| $to be large relative to noise. For softmax, success depends on margin ( (w_{s_t}-w_v)^\top h_t + (b_{s_t}-b_v) $being positive.


7. Practical takeaways for your study

  • SIPIT reveals that activations contain recoverable, prefix-conditioned prototypes. Softmax heads may approximate that mapping for frequent contexts, but they are not guaranteed to encode the full, prefix-dependent prototype structure.
  • If an attacker only has access to (h_t $and not $ \Phi $or the prefix, they cannot run SIPIT; they rely on whatever head or side-information exists. This limits real-world attackers but does not eliminate the risk because activations are often stored or shared.
  • For your backdoor/privacy paper: quantify scenarios where softmax would misclassify but SIPIT recovers (or vice versa). That highlights that storing activations is dangerous even if model heads are available — because inversion may give stronger recovery than simple supervised classification of embeddings.

8. Want a formal equivalence condition?

Short form: for a fixed prefix (p), SIPIT reduces to a linear score in (h): [ -| \mu_{p,v} - h|^2 = -h^\top h + 2\mu_{p,v}^\top h - \mu_{p,v}^\top\mu_{p,v}. ] Dropping the common (-h^\top h), ranking by SIPIT is equivalent to ranking by [ s_{p}(v; h $;=; 2\mu_{p,v}^\top h - \mu_{p,v}^\top\mu_{p,v}. ] So if a softmax head can realize weights (w_v = 2\mu_{p,v} $and bias (b_v = -\mu_{p,v}^\top\mu_{p,v}), it will match SIPIT on that prefix. The obstacle is that $ \mu_{p,v} $depends on (p); to match SIPIT globally the head must implicitly encode all prefix-dependent prototypes — generally infeasible with a fixed head.