LLM Inference Walkthrough — Tensor Shapes and Core Formulas Across the Whole Pipeline

Lay out a Llama 3-style dense decoder-only model end to end from embedding to sampling, embedding the math positions of common variants (MHA / MQA / GQA / MLA, RMSNorm / LayerNorm, SwiGLU / GeGLU, Flash Attention / Paged Attention) along the way — so that after reading you can draw this diagram from memory. One fact frames the whole article: this architecture has barely changed in twenty years; every “inference optimization” is a local surgery somewhere on the same skeleton.

Notation Conventions — Llama 3 8B as the Reference

The same notation is used throughout, with Llama 3 8B as the concrete example:

Symbol	Meaning	Example value (Llama 3 8B)
$B$	batch size	2
$S$	prompt length	10
$L$	number of layers	32
$H$	hidden dim	4096
$V$	vocab size	128256
$n_q$	number of Q heads	32
$n_{kv}$	number of KV heads (GQA)	8
$d$	per-head dim $= H/n_q$	128
$I$	FFN intermediate dim	14336
$T$	number of generated tokens	100
$t$	current decode step	$1..T$

All shape annotations follow PyTorch convention, written as [B, ..., H]; weight matrices follow “in × out”, written as $W \in \mathbb{R}^{[\text{in}, \text{out}]}$ .

Core Formulas Quick Reference — Embedding · Norm · Attn · FFN · LM Head

Embedding

\mathbf{x}_i = E[\text{token\_id}_i] \in \mathbb{R}^{H}

$\mathbf{x}_i$ — embedding vector at position $i$
$E \in \mathbb{R}^{V \times H}$ — embedding lookup table ( $V$ tokens, each $H$ -dim)
$\text{token\_id}_i \in \{0, 1, \ldots, V-1\}$ — integer ID of the $i$ -th input token

Many implementations share $E$ with the LM Head’s $W_{\text{lm}}$ (tied embedding), saving memory and providing mild regularization; Llama models do not share by default.

Normalization: LayerNorm vs RMSNorm

Standard LayerNorm:

\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^{2} + \epsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}, \quad \mu = \tfrac{1}{H}\sum x_i,\ \sigma^{2} = \tfrac{1}{H}\sum (x_i - \mu)^{2}

$\mathbf{x} \in \mathbb{R}^{H}$ — single-token hidden state
$\mu, \sigma^{2} \in \mathbb{R}$ — mean and variance computed across the $H$ dimensions
$\epsilon$ — small constant to avoid divide-by-zero (typically $10^{-5}$ )
$\boldsymbol{\gamma}, \boldsymbol{\beta} \in \mathbb{R}^{H}$ — learnable scale / shift
$\odot$ — element-wise multiplication

RMSNorm (the mainstream choice for Llama / Mistral / Qwen):

\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\sqrt{\frac{1}{H}\sum_{i=1}^{H} x_i^{2} + \epsilon}} \odot \boldsymbol{\gamma}

$\mathbf{x} \in \mathbb{R}^{H}$ — single-token hidden state
$\boldsymbol{\gamma} \in \mathbb{R}^{H}$ — learnable per-channel scale (no $\boldsymbol{\beta}$ )
$\epsilon$ — small constant to avoid divide-by-zero
The denominator is the root mean square of $\mathbf{x}$ — hence “RMSNorm”

RMSNorm drops the mean and $\boldsymbol{\beta}$ , roughly halving both compute and parameters; empirically the quality cost is nearly zero. All mainstream inference engines organize this as pre-norm: norm lives inside the residual branch, and the residual main path bypasses it.

Q/K/V Projection + Positional Encoding

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

$X \in \mathbb{R}^{B \times S \times H}$ — output of the previous RMSNorm
$W_Q \in \mathbb{R}^{H \times n_q d}$ — Q projection weight
$W_K, W_V \in \mathbb{R}^{H \times n_{kv} d}$ — K, V projection weights (narrower under GQA since $n_{kv} < n_q$ )
$Q \in \mathbb{R}^{B \times S \times n_q d}$ , $K, V \in \mathbb{R}^{B \times S \times n_{kv} d}$ — projection outputs, later reshaped to expose the head dimension

RoPE (Rotary Positional Embedding) core idea: rather than adding “absolute position $m$ ” to the embedding (like the original Transformer’s sinusoidal), let position act as a rotation on Q and K, so that when the two vectors are dotted only the relative position $n - m$ survives.

Split each head’s $d$ dimensions into $d/2$ two-dim subspaces; the $k$ -th subspace ( $k = 0, 1, \ldots, d/2 - 1$ ) corresponds to coordinates $(q_{2k}, q_{2k+1})$ . At position $m$ that pair is multiplied by a 2D rotation matrix with angle $m\theta_k$ :

\begin{pmatrix} q'^{(m)}_{2k} \\ q'^{(m)}_{2k+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_k) & -\sin(m\theta_k) \\ \sin(m\theta_k) & \phantom{-}\cos(m\theta_k) \end{pmatrix} \begin{pmatrix} q_{2k} \\ q_{2k+1} \end{pmatrix}, \quad \theta_k = \text{base}^{-2k/d}

$m \in \{0, 1, \ldots, S-1\}$ — absolute position of the current token
$k \in \{0, 1, \ldots, d/2 - 1\}$ — 2D subspace index (every two dims of the per-head dim $d$ form one pair)
$(q_{2k}, q_{2k+1})$ — the $k$ -th 2D pair of the projected Q vector
$(q'^{(m)}_{2k}, q'^{(m)}_{2k+1})$ — the rotated pair at position $m$
$\theta_k$ — base angular velocity of the $k$ -th subspace
$\text{base}$ — frequency-decay base (Llama default $10000$ ; YaRN etc. dynamically enlarge it)

Expanded to scalars:

q'^{(m)}_{2k} = q_{2k}\cos(m\theta_k) - q_{2k+1}\sin(m\theta_k), \qquad q'^{(m)}_{2k+1} = q_{2k}\sin(m\theta_k) + q_{2k+1}\cos(m\theta_k)

Same symbols as above — this just unrolls the $2\times 2$ matrix into two scalar identities, easier to map to code.

This is the standard counterclockwise rotation matrix $R(\phi) = \begin{pmatrix} \cos\phi & -\sin\phi \\ \sin\phi & \phantom{-}\cos\phi \end{pmatrix}$ ; acting on a 2D vector is equivalent to multiplying by $e^{i\phi}$ in the complex plane — which is why the official Llama repo reshapes $(q_{2k}, q_{2k+1})$ to complex64 and multiplies by $\cos(m\theta_k) + i\sin(m\theta_k)$ directly (HuggingFace equivalently pairs $(q_k, q_{k+d/2})$ in a “half-rotation” form; the two differ only by a coordinate permutation). K rotates by the same angle to get $\mathbf{k}'^{(n)}$ .

Why does rotation encode relative position? Let $\mathbf{R}_m$ denote the block-diagonal full- $d$ rotation (the $k$ -th $2\times 2$ block uses angle $m\theta_k$ ). It’s orthogonal and satisfies $\mathbf{R}_m^{\top}\mathbf{R}_n = \mathbf{R}_{n-m}$ , so:

\langle \mathbf{R}_m \mathbf{q},\; \mathbf{R}_n \mathbf{k} \rangle = \mathbf{q}^{\top} \mathbf{R}_m^{\top} \mathbf{R}_n \mathbf{k} = \mathbf{q}^{\top} \mathbf{R}_{n-m} \mathbf{k}

$\mathbf{q}, \mathbf{k} \in \mathbb{R}^{d}$ — per-head Q, K vectors (pre-rotation)
$\mathbf{R}_m, \mathbf{R}_n \in \mathbb{R}^{d \times d}$ — block-diagonal rotation matrices for positions $m$ , $n$
$\langle \cdot, \cdot \rangle$ — standard inner product
Last step uses $\mathbf{R}_m^{\top} \mathbf{R}_n = \mathbf{R}_{n-m}$ : rotations are orthogonal and angles are additive

When attention computes $QK^{\top}$ , every $(q_i, k_j)$ score depends only on the difference $j - i$ — absolute position is automatically canceled inside the dot product, leaving relative position. This is RoPE’s key property and the fundamental reason it is more stable than additive positional encodings.

Frequency spectrum design: $\theta_k = \text{base}^{-2k/d}$ ( $\text{base}$ usually $10000$ ; $d$ is the per-head dim, not model hidden dim) assigns the $d/2$ subspaces angular velocities from fast to slow:

$k = 0$ : $\theta_0 = 1$ , period $2\pi \approx 6.28$ tokens — carries the “near-neighbor” signal.
$k = d/2 - 1$ : $\theta \approx \text{base}^{-(d-2)/d} \approx 10^{-4}$ , period $\approx 2\pi \cdot 10000 \approx 62\text{K}$ tokens — carries the “long-range” signal.

The geometric distribution lets one head carry positional signals at many scales simultaneously — isomorphic to the original Transformer’s sinusoidal, just moved from addition to multiplication.

Long-context scaling: training only saw $m \le L_{\text{train}}$ , and low-frequency subspaces can’t even complete one full period inside $L_{\text{train}}$ ; once inference reaches $m > L_{\text{train}}$ , the low-frequency angles fall outside the training distribution and attention immediately degrades. Three common fixes all modify $\theta_k$ :

Position Interpolation (Chen et al. 2023): $m \to m/s$ , equivalent to $\theta_k \to \theta_k/s$ — compresses all frequencies uniformly. Simple but wastes high-frequency precision.
NTK-aware scaling: scales low frequencies while preserving high ones, equivalent to $\text{base} \to \text{base} \cdot s^{d/(d-2)}$ .
YaRN (Peng et al. 2023): band-wise treatment — high frequencies (period $\ll L_{\text{train}}$ , full periods seen during training) are left alone, low frequencies (period $\gg L_{\text{train}}$ , no full periods seen) are PI-scaled, and the mid band interpolates smoothly; additionally a $1/\sqrt{t}$ temperature correction cancels the attention-entropy drift introduced by the scaling. Llama 3.1 / 3.2 went from 8K → 128K with YaRN.

RoPE acts only on Q and K, not on V — V is the weighted value itself and needs no positional signal.

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d}} + M\right) V

$Q, K, V$ — projected and head-split tensors, with the last two dims being $[S, d]$ (head dim broadcast implicit)
$QK^{\top} \in \mathbb{R}^{S \times S}$ — similarity matrix between every $(q_i, k_j)$ pair
$\sqrt{d}$ — scaling factor that keeps softmax out of the near-zero-gradient region
$M \in \mathbb{R}^{S \times S}$ — causal mask with upper triangle set to $-\infty$ (position $i$ can only see $\le i$ )
softmax normalizes along the K dim, producing attention weights that then re-weight V back to $[S, d]$

Multi-Head Variants: MHA / MQA / GQA / MLA

All four variants differ only on the K, V side: Q is always $n_q$ independent heads; what changes is how many independent K, V sets exist, and whether K, V are low-rank compressed. Below we write head-level formulas for original MHA and each of the three variants under one notation — dropping the batch and layer dims, looking at head $h$ at position $i$ .

Variant	$n_{kv}$	Per-token cache (per layer, fp16)	Representative models
MHA	$= n_q$	$2 \cdot n_q \cdot d \cdot 2\text{B}$	GPT-2/3, Llama 1/2 7B
GQA	grouped $< n_q$	$2 \cdot n_{kv} \cdot d \cdot 2\text{B}$	Llama 3, Qwen2, Mistral
MQA	$= 1$	$2 \cdot d \cdot 2\text{B}$	PaLM, Falcon
MLA	low-rank compressed	$(d_c + d_r) \cdot 2\text{B}$	DeepSeek V2/V3

MHA — Multi-Head Attention (original, Vaswani et al. 2017)

Each Q head gets its own K, V — $n_q$ independent sets:

\mathbf{q}^{(h)}_i = W_Q^{(h)} \mathbf{x}_i, \quad \mathbf{k}^{(h)}_i = W_K^{(h)} \mathbf{x}_i, \quad \mathbf{v}^{(h)}_i = W_V^{(h)} \mathbf{x}_i, \quad h = 1, \ldots, n_q

\text{head}^{(h)} = \text{softmax}\!\left(\frac{Q^{(h)} {K^{(h)}}^{\top}}{\sqrt{d}} + M\right) V^{(h)}, \qquad \text{Attn} = [\text{head}^{(1)}; \ldots; \text{head}^{(n_q)}] \cdot W_O

$W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} \in \mathbb{R}^{H \times d}$ — per-head Q, K, V projections (concatenated into a big matrix in practice, mathematically equivalent)
Each token writes $n_q$ pairs $(\mathbf{k}^{(h)}, \mathbf{v}^{(h)})$ into the cache — $2 n_q d$ scalars in total
Llama 2 7B: $n_q = 32, d = 128$ , per-token cache = $2 \cdot 32 \cdot 128 \cdot 2\text{B} = 16\text{ KB}$ / layer

MQA — Multi-Query Attention (Shazeer 2019)

All $n_q$ Q heads share a single K, V — only 1 set left:

\mathbf{q}^{(h)}_i = W_Q^{(h)} \mathbf{x}_i, \quad \mathbf{k}_i = W_K \mathbf{x}_i, \quad \mathbf{v}_i = W_V \mathbf{x}_i

\text{head}^{(h)} = \text{softmax}\!\left(\frac{Q^{(h)} K^{\top}}{\sqrt{d}} + M\right) V, \qquad h = 1, \ldots, n_q

$W_K, W_V \in \mathbb{R}^{H \times d}$ — one shared K, V projection across all heads
Cache shrinks to $2d$ — $n_q\times$ smaller than MHA; but capacity-constrained, large models drop quality if MQA is applied directly — rarely used standalone in practice, mostly superseded by GQA

GQA — Grouped-Query Attention (Ainslie et al. 2023)

Split the $n_q$ Q heads into $n_{kv}$ groups, share K, V within each group — a continuous interpolation between MHA and MQA:

\mathbf{q}^{(h)}_i = W_Q^{(h)} \mathbf{x}_i, \quad \mathbf{k}^{(g)}_i = W_K^{(g)} \mathbf{x}_i, \quad \mathbf{v}^{(g)}_i = W_V^{(g)} \mathbf{x}_i

\text{head}^{(h)} = \text{softmax}\!\left(\frac{Q^{(h)} {K^{(g(h))}}^{\top}}{\sqrt{d}} + M\right) V^{(g(h))}, \qquad g(h) = \lfloor h \cdot n_{kv} / n_q \rfloor

$W_K^{(g)}, W_V^{(g)} \in \mathbb{R}^{H \times d}$ , $g = 1, \ldots, n_{kv}$ — one K, V set per group
$g(h)$ — group index of Q head $h$
$n_{kv} = n_q$ recovers MHA, $n_{kv} = 1$ recovers MQA
Kernels materialize only $n_{kv}$ K, V sets; during score computation, K and V are broadcast along the group dim up to $n_q$ , not actually replicated
Llama 3 70B: $n_q = 64, n_{kv} = 8, d = 128$ , per-token cache = $2 \cdot 8 \cdot 128 \cdot 2\text{B} = 4\text{ KB}$ / layer — 8× smaller than a same-shape MHA

MLA — Multi-Head Latent Attention (DeepSeek V2 2024)

GQA only shrinks cache linearly in head count; MLA jointly compresses K and V into a low-rank latent $\mathbf{c}^{KV}$ , and breaks RoPE off into a shared shallow branch. Four steps:

(1) Content branch — K, V share a down-projection; only the latent $\mathbf{c}^{KV}_i$ goes into the cache:

\mathbf{c}^{KV}_i = W^{DKV} \mathbf{x}_i, \quad \mathbf{k}^{C,(h)}_i = W_K^{U,(h)} \mathbf{c}^{KV}_i, \quad \mathbf{v}^{(h)}_i = W_V^{U,(h)} \mathbf{c}^{KV}_i

(2) Q-side low-rank — saves training memory; Q is never cached at inference:

\mathbf{c}^{Q}_i = W^{DQ} \mathbf{x}_i, \quad \mathbf{q}^{C,(h)}_i = W_Q^{U,(h)} \mathbf{c}^{Q}_i

(3) Decoupled RoPE branch — K side computes one shared $d_r$ -dim vector reused by all heads:

\mathbf{q}^{R,(h)}_i = \text{RoPE}(W_Q^{R,(h)} \mathbf{c}^{Q}_i), \quad \mathbf{k}^{R}_i = \text{RoPE}(W^{KR} \mathbf{x}_i)

(4) Concat + attention — content and RoPE parts are concatenated along the head dim before scoring:

\mathbf{q}^{(h)}_i = [\mathbf{q}^{C,(h)}_i; \mathbf{q}^{R,(h)}_i], \quad \mathbf{k}^{(h)}_i = [\mathbf{k}^{C,(h)}_i; \mathbf{k}^{R}_i]

\text{head}^{(h)} = \text{softmax}\!\left(\frac{\mathbf{q}^{(h)\top}_i \mathbf{k}^{(h)}_{\le i}}{\sqrt{d + d_r}} + M\right) \mathbf{v}^{(h)}_{\le i}

$W^{DKV} \in \mathbb{R}^{H \times d_c}$ — shared KV down-projection (DeepSeek V3 uses $d_c = 512$ )
$W_K^{U,(h)}, W_V^{U,(h)} \in \mathbb{R}^{d_c \times d}$ — per-head up-projections
$W^{KR} \in \mathbb{R}^{H \times d_r}$ — shared RoPE K branch (DeepSeek V3 uses $d_r = 64$ )
$\mathbf{c}^{KV}_i \in \mathbb{R}^{d_c}, \mathbf{k}^{R}_i \in \mathbb{R}^{d_r}$ — these two are all that MLA actually caches — $d_c + d_r$ scalars
DeepSeek V3: $n_q = 128, d = 128, d_c = 512, d_r = 64$ , per-token cache = $(512 + 64) \cdot 2\text{B} = 1152$ bytes / layer — another order of magnitude below a same-scale GQA

Why does RoPE need its own branch? At inference there’s a “weight-folding” trick — by associativity, fold $W_K^{U,(h)}$ into $W_Q^{U,(h)}$ :

{\mathbf{q}^{C,(h)}_i}^{\top} \mathbf{k}^{C,(h)}_j = {\mathbf{c}^{Q}_i}^{\top} \underbrace{(W_Q^{U,(h)})^{\top} W_K^{U,(h)}}_{\text{pre-multiplied offline}} \mathbf{c}^{KV}_j

K’s content part is never actually reconstructed — attention runs directly on the cached $\mathbf{c}^{KV}$ . But RoPE’s rotation angle depends on absolute position $m$ , so it cannot be folded into fixed weights — applying RoPE on reconstructed K kills the fold. DeepSeek’s fix: pull RoPE out into an independent $d_r$ -dim shallow branch, so “foldable content” and “must-rotate-live position” don’t interfere. Cache compression and relative-position signal both survive.

Overview: Params / FLOPs / KV cache

Below we decompose each variant’s cost step by step. Conventions: one attention sublayer per layer, prefill length $S$ , ignoring norm / bias / softmax and other non-matmul terms; a linear projection $X W$ (with $X \in \mathbb{R}^{S \times m}, W \in \mathbb{R}^{m \times n}$ ) counts as $2Smn$ FLOPs (2 FLOPs per MAC); $QK^{\top}$ and $AV$ do not deduct the causal-mask triangular half. MLA additionally uses $d_q'$ = Q-side latent dim (DeepSeek V3 uses $1536$ ).

Parameters (per layer)

Step	MHA	MQA	GQA	MLA
Q proj	$H \cdot n_q d$	$H \cdot n_q d$	$H \cdot n_q d$	$H d_q' + d_q' \cdot n_q d$
K proj	$H \cdot n_q d$	$H \cdot d$	$H \cdot n_{kv} d$	$H d_c + d_c \cdot n_q d$
V proj	$H \cdot n_q d$	$H \cdot d$	$H \cdot n_{kv} d$	$d_c \cdot n_q d$ (shares $W^{DKV}$ with K)
RoPE branch	—	—	—	$d_q' \cdot n_q d_r + H d_r$
$W_O$	$n_q d \cdot H$	$n_q d \cdot H$	$n_q d \cdot H$	$n_q d \cdot H$
Total	$4 H n_q d$	$2 H n_q d + 2 H d$	$2 H n_q d + 2 H n_{kv} d$	sum of rows above

Prefill FLOPs (per layer, sequence length $S$ )

Step	MHA	MQA	GQA	MLA
Q proj	$2 S H n_q d$	$2 S H n_q d$	$2 S H n_q d$	$2 S H d_q' + 2 S d_q' n_q d$
K proj	$2 S H n_q d$	$2 S H d$	$2 S H n_{kv} d$	$2 S H d_c + 2 S d_c n_q d$
V proj	$2 S H n_q d$	$2 S H d$	$2 S H n_{kv} d$	$2 S d_c n_q d$
RoPE branch	$\mathcal{O}(S n_q d)$	$\mathcal{O}(S n_q d)$	$\mathcal{O}(S n_q d)$	$2 S d_q' n_q d_r + 2 S H d_r$
$QK^{\top}$	$2 n_q S^2 d$	$2 n_q S^2 d$	$2 n_q S^2 d$	$2 n_q S^2 (d + d_r)$
softmax · $V$	$2 n_q S^2 d$	$2 n_q S^2 d$	$2 n_q S^2 d$	$2 n_q S^2 d$
$W_O$	$2 S H n_q d$	$2 S H n_q d$	$2 S H n_q d$	$2 S H n_q d$

“RoPE branch” in the MHA/MQA/GQA columns means the elementwise rotation applied to Q, K — $\mathcal{O}(S n_q d)$ , negligible next to matmuls. In MLA it refers specifically to the extra $W_Q^R$ / $W^{KR}$ projections, which are genuine matmuls and must be counted.

KV cache (per token, per layer, fp16)

Variant	What’s cached	Bytes
MHA	$n_q$ pairs of $(\mathbf{k}, \mathbf{v}) \in \mathbb{R}^d$	$2 \cdot n_q \cdot d \cdot 2\text{B}$
MQA	1 pair $(\mathbf{k}, \mathbf{v})$	$2 \cdot d \cdot 2\text{B}$
GQA	$n_{kv}$ pairs $(\mathbf{k}, \mathbf{v})$	$2 \cdot n_{kv} \cdot d \cdot 2\text{B}$
MLA	$\mathbf{c}^{KV} \in \mathbb{R}^{d_c}$ + $\mathbf{k}^R \in \mathbb{R}^{d_r}$	$(d_c + d_r) \cdot 2\text{B}$

Three observations from reading these tables horizontally:

All variants only touch the K/V side. Q projection, $W_O$ , and $AV$ are identical across the four — the surgery only touches K, V, so total attention params and FLOPs never differ by an order of magnitude at the same model size.
MHA → MQA/GQA saves params, FLOPs, and cache together; MLA trades params/FLOPs for cache. GQA shrinks $n_{kv}$ , scaling K/V projections, FLOPs, and cache down linearly. MLA does the opposite — adds DKV + two UK/UV stages + a dedicated RoPE branch, leaving params and prefill FLOPs at the same order as a same-size GQA, in exchange for shrinking per-token cache from KB scale down to ~1 KB.
The $S^2$ term doesn’t differ across MHA/MQA/GQA. $QK^{\top}$ and $AV$ are both $2 n_q S^2 d$ (K, V sharing is broadcast only, never reducing the quadratic term). So once $S \gg H/d$ , the three variants’ prefill FLOPs converge — what truly separates them is how many cache bytes each decoded token must read, which is memory I/O, not compute.

Output Projection + Residual

\mathbf{h} = \mathbf{x} + \text{Attn}(Q', K', V') \cdot W_O

$\mathbf{x} \in \mathbb{R}^{H}$ — attention sublayer input (the value before pre-norm)
$Q', K', V'$ — RoPE-rotated Q and K; $V' = V$ (V is not rotated, the prime is just notational uniformity)
$W_O \in \mathbb{R}^{n_q d \times H}$ — output projection that maps concatenated multi-head output back to $H$
$\mathbf{h} \in \mathbb{R}^{H}$ — attention sublayer output + residual

FFN Variants

Classic Bilinear FFN (GPT-2)

\text{FFN}(\mathbf{x}) = \phi(\mathbf{x} W_1) W_2

$\mathbf{x} \in \mathbb{R}^{H}$ — FFN input
$W_1 \in \mathbb{R}^{H \times I}$ — up-projection
$W_2 \in \mathbb{R}^{I \times H}$ — down-projection
$\phi$ — element-wise scalar nonlinearity (see below)

Activation Functions

The scalar nonlinearity applied after the up-projection. Input and output are both scalars $x$ ; it’s applied independently per token, per hidden dim. Four functions cover all the mainstream choices used in the Transformer era.

ReLU (Nair & Hinton 2010)

\text{ReLU}(x) = \max(0, x)

Identity on $x > 0$ , zero on $x \le 0$
Cheapest to compute; but zero gradient on the negative side — the “dead neuron” problem
Used in the original Transformer and early BERT implementations

Sigmoid

\sigma(x) = \frac{1}{1 + e^{-x}}

Compresses $\mathbb{R}$ into $(0, 1)$ , a natural “gate” signal — the original GLU’s $\phi$ is exactly this
Largely abandoned as a standalone FFN activation — saturates at both ends, killing gradients

GeLU (Gaussian Error Linear Unit, Hendrycks & Gimpel 2016)

\text{GeLU}(x) = x \cdot \Phi(x), \quad \Phi(x) = \tfrac{1}{2}\big(1 + \text{erf}(x/\sqrt{2})\big)

In practice, the OpenAI tanh approximation is used (numerical error $< 10^{-3}$ , avoids the erf call):

\text{GeLU}(x) \approx 0.5\, x \left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(x + 0.044715\, x^{3}\right)\right]\right)

$\Phi$ is the standard normal CDF — intuitively “let $x$ through weighted by its tail probability”
Everywhere differentiable, non-monotonic (a small negative dip on the $x < 0$ side), smoother than ReLU
Used in GPT-2/3, BERT, ViT

SiLU / Swish (Ramachandran et al. 2017)

\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

Shape very close to GeLU (also smooth, non-monotonic, passes through origin) but with a simpler closed form — no $\text{erf}$ , no cubic term
Self-gated: uses its own sigmoid to control signal throughput
Used by PaLM and the entire Llama family (as the gating $\phi$ in SwiGLU)

Of these four, ReLU and sigmoid have largely exited mainstream FFNs; the active ones are GeLU (GPT era) and SiLU (Llama / PaLM and later). The bf16 numerical difference between the two is < 1%; paper choice is mostly path-dependent. The real engineering inflection point was switching activation from “applied to the projection” (classic FFN) to “applied to the gate” (the GLU family below).

GLU Family (used by Llama, PaLM, Mistral)

\text{GLU}(\mathbf{x}) = \big(\phi(\mathbf{x} W_{\text{gate}}) \odot (\mathbf{x} W_{\text{up}})\big) W_{\text{down}}

$\mathbf{x} \in \mathbb{R}^{H}$ — FFN input
$W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{H \times I}$ — two independent up-projections
$W_{\text{down}} \in \mathbb{R}^{I \times H}$ — down-projection
$\odot$ — element-wise multiplication
$\phi$ — gating activation; choice determines the variant name:

Variant $\phi$ Representative model
GLU $\sigma$ Dauphin et al. 2017 original
ReGLU $\text{ReLU}$ —
GeGLU $\text{GeLU}$ T5 v1.1
SwiGLU $\text{SiLU}$ PaLM, Llama 1/2/3

Variant	$\phi$	Representative model
GLU	$\sigma$	Dauphin et al. 2017 original
ReGLU	$\text{ReLU}$	—
GeGLU	$\text{GeLU}$	T5 v1.1
SwiGLU	$\text{SiLU}$	PaLM, Llama 1/2/3

SwiGLU has one more projection than the classic GeLU-FFN (three matrices vs two); to align parameter budget, implementations typically set $I = \tfrac{2}{3} \cdot 4H$ ( $4H$ is the GPT-2 convention), and Llama 3 8B’s $I = 14336 \approx \tfrac{2}{3}\cdot 4\cdot 4096 \cdot 1.3$ .

Mixture-of-Experts (MoE)

In a classic dense FFN, every token passes through the same $W_{\text{up}} / W_{\text{down}}$ pair — all parameters used, all FLOPs paid. MoE (Shazeer et al. 2017) replicates the FFN $N$ times (“experts”) and routes each token through only the top $k$ , so total parameters scale linearly while per-token activated params and FLOPs stay almost flat — capacity grows without paying compute.

Formally the FFN sublayer becomes:

\text{MoE}(\mathbf{x}) = \sum_{i \in \mathcal{T}_k(\mathbf{x})} g_i(\mathbf{x}) \cdot \text{FFN}_i(\mathbf{x})

The router (gating network) decides which experts each token visits:

\mathbf{s}(\mathbf{x}) = \text{softmax}(\mathbf{x} W_g), \quad \mathcal{T}_k(\mathbf{x}) = \text{TopK}(\mathbf{s}(\mathbf{x}), k)

g_i(\mathbf{x}) = \frac{s_i(\mathbf{x}) \cdot \mathbb{1}[i \in \mathcal{T}_k(\mathbf{x})]}{\sum_{j \in \mathcal{T}_k(\mathbf{x})} s_j(\mathbf{x})}

$\mathbf{x} \in \mathbb{R}^{H}$ — current token’s hidden state (FFN sublayer input)
$W_g \in \mathbb{R}^{H \times N}$ — router projection mapping the hidden state to $N$ expert logits
$\mathbf{s}(\mathbf{x}) \in \mathbb{R}^{N}$ — router probabilities across all experts
$\mathcal{T}_k(\mathbf{x})$ — index set of top- $k$ selected experts
$g_i(\mathbf{x})$ — combine weight; Mixtral and DeepSeek re-normalize the top- $k$ scores so they sum to 1
$\text{FFN}_i$ — $i$ -th expert, typically a SwiGLU FFN with its own (unshared) weights

MoE vs Dense FFN

Dimension	Dense FFN (SwiGLU)	MoE (top- $k$ of $N$ )
Params (FFN block)	$3 H I$	$3 H I \cdot N$ + $H N$ (router)
Per-token activated FLOPs	$6 H I$	$6 H I \cdot k$ + $2 H N$ (router)
Per-token weight HBM (decode)	$3 H I$ bytes	$3 H I \cdot k$ bytes
VRAM footprint	$3 H I$	$3 H I \cdot N$ (all experts must fit)
Kernel shape	fixed GEMM	grouped GEMM / token permutation
Multi-GPU comm	—	all-to-all under expert parallelism
Training stability	straightforward	router prone to collapse, needs load-balance

Key trade-off: decoupling capacity from FLOPs. At the same activated parameter count (i.e. same FLOP budget), MoE can pack 8–32× more total parameters — knowledge capacity goes up for free. The costs are VRAM (must fit all experts), routing stability (avoiding hot experts), and inference batching (per-token paths differ).

Load Balance

A naive router collapses onto a few hot experts during training. GShard / Switch use an auxiliary loss as a soft constraint:

\mathcal{L}_{\text{aux}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot \bar{s}_i

$f_i$ — fraction of tokens in the current batch routed to expert $i$
$\bar{s}_i$ — mean router probability assigned to expert $i$ within the batch
$\alpha$ — auxiliary loss weight (Switch uses $\sim 0.01$ )
Intuition: large $f_i$ AND large $\bar{s}_i$ means the expert is chosen often and with high confidence → penalized → gradient pushes that router logit back down

DeepSeek V3 goes further with auxiliary-loss-free load balance: maintain a bias $b_i$ per expert, base TopK on $s_i + b_i$ ; raise $b_i$ for under-used experts and lower it for over-used ones — affecting selection without polluting gradients, avoiding the main-task accuracy hit that aux loss can cause.

Modern Implementations

Model	Total / activated	$N$ routed	top- $k$	Shared	Routing notes
Switch Transformer (2021)	1.6 T / ~26 B	2048	1	—	First stably-trained large MoE; hard top-1 + capacity factor
GLaM (2022)	1.2 T / 97 B	64	2	—	Google decoder-only MoE; halves inference cost vs dense
Mixtral 8×7B (2023)	47 B / 13 B	8	2	—	Open-source “8 big experts” reference; per-layer routing
Mixtral 8×22B (2024)	141 B / 39 B	8	2	—	Scaled-up 8×7B
Qwen1.5-MoE-A2.7B (2024)	14 B / 2.7 B	60	4	4	Alibaba’s first fine-grained MoE
DeepSeek V2 (2024)	236 B / 21 B	160	6	2	Fine-grained + shared-expert paradigm established
DeepSeek V3 (2024)	671 B / 37 B	256	8	1	Aux-loss-free load balance; paired with MLA
Qwen3-MoE 235B-A22B (2025)	235 B / 22 B	128	8	—	DeepSeek-style fine-grained
Llama 4 Scout (2025)	109 B / 17 B	16	1	1	top-1 + 1 shared; extreme sparsity
Llama 4 Maverick (2025)	400 B / 17 B	128	1	1	Same idea, expert count pushed to 128
GPT-4 (rumored)	~1.8 T / ~280 B	16	2	—	Never officially disclosed; semiconductor-analyst reconstructions

Four patterns:

Coarse → fine. Mixtral-era designs are “few-and-fat” (8 experts each near a dense FFN’s size); DeepSeek onward shrinks each expert and pushes count past 100, going “many-and-thin” — the same activated FLOPs now span exponentially more combinations.
Shared experts become standard. DeepSeek / Qwen-MoE / Llama 4 all reserve 1–2 “everyone-must-visit” shared experts per layer to absorb generic patterns; the sparse experts handle specialization.
MoE pairs with KV-cache compression. Pushing total params to hundreds of billions makes long-context decode equally pressured — it’s no coincidence DeepSeek V3 ships MLA + fine-grained MoE together.
top- $k$ goes to the extremes. Switch (k=1) → Mixtral (k=2) → DeepSeek V3 (k=8 with small experts) → Llama 4 (k=1 + shared). Small $k$ makes batching and capacity bounds tractable; fine-graining and shared experts make up the lost expressivity.

Residual

\mathbf{x}_{\text{out}} = \mathbf{h} + \text{FFN}(\text{RMSNorm}(\mathbf{h}))

$\mathbf{h}$ — attention sublayer output (already contains the first residual)
$\mathbf{x}_{\text{out}} \in \mathbb{R}^{H}$ — full Transformer layer output, fed into the next layer
RMSNorm sits inside the residual branch (pre-norm); the main path $\mathbf{h}$ goes straight through

LM Head + Sampling

\text{logits} = \mathbf{x}_{\text{final}} W_{\text{lm}}, \quad \mathbf{p} = \text{softmax}(\text{logits}/T)

$\mathbf{x}_{\text{final}} \in \mathbb{R}^{H}$ — hidden state after the final RMSNorm, at the last position
$W_{\text{lm}} \in \mathbb{R}^{H \times V}$ — output embedding matrix (optionally shared with $E$ — tied embedding)
$\text{logits} \in \mathbb{R}^{V}$ — unnormalized score per vocab token
$T > 0$ — temperature; larger flattens the distribution ( $T \to \infty$ → uniform, $T \to 0$ → argmax)
$\mathbf{p} \in \mathbb{R}^{V}$ — normalized probability distribution

Several logits transformations are typically applied before sampling:

Repetition penalties (repetition / frequency / presence penalty):

\text{logits}'_v = \text{logits}_v - \alpha \cdot \mathbb{1}[v \in \text{history}] - \beta \cdot \text{count}(v)

$v$ — a token in the vocabulary
$\text{logits}_v$ — the raw logit for that token
$\alpha \ge 0$ — presence penalty (subtract once if ever seen)
$\beta \ge 0$ — frequency penalty (subtract linearly by occurrence count)
$\mathbb{1}[\cdot]$ — indicator function (1 if the condition holds, else 0)
$\text{count}(v)$ — number of times token $v$ has appeared in the generated sequence

Top-k: keep the largest $k$ logits, set the rest to $-\infty$ .

Top-p (nucleus): sort by probability descending, keep the smallest set whose cumulative probability $\le p$ .

Min-p: keep tokens with $p_v \ge p_{\min} \cdot p_{\max}$ ; friendlier to low-entropy distributions.

Typical-p: truncate by deviation from the conditional entropy, keeping the set where $|-\log p_v - H(\mathbf{p})|$ is small.

All truncations act before/after the probability distribution itself, without changing the skeleton of the formula.

Prefill Stage Shape Transitions — S Tokens Walk Through Once

Input: input_ids [B, S] = [2, 10]. The figure below shows the forward pass of one Transformer layer, wrapped in 32 layers; the shape stays at [B, S, H] = [2, 10, 4096] throughout, with residual edges shown as orange dashed lines.

Prefill — shape transitions and core formulas of a single Transformer layer. Orange dashed lines are pre-norm residuals; ⊕ marks residual merges; the left bracket marks ”× 32 layers.”

After prefill, the KV Cache state: each layer holds the first 10 positions.

Decode Stage Shape Transitions — Step t Processes Only 1 Token

Prior state: $\text{cache\_len} = S + t - 1$ positions filled.

Input: input_ids [B, 1] = [2, 1] (the 1 token generated at the previous step).

Decode — step t processes only 1 token, with core formulas; the KV Cache keeps growing; the blue dashed line shows the generated next_token feeding back as the next step’s input.

Prefill vs Decode Shape Comparison — GEMM vs GEMV · Compute vs Bandwidth

Position	Prefill	Decode (per step)
input_ids	$[B, S]$	$[B, 1]$
after embedding	$[B, S, H]$	$[B, 1, H]$
Q	$[B, n_q, S, d]$	$[B, n_q, 1, d]$
K_new / V_new	$[B, n_{kv}, S, d]$	$[B, n_{kv}, 1, d]$
K_full / V_full (from cache)	same as K_new	$[B, n_{kv}, \text{cache\_len}, d]$
attention scores	$[B, n_q, S, S]$	$[B, n_q, 1, \text{cache\_len}]$
attention output	$[B, n_q, S, d]$	$[B, n_q, 1, d]$
FFN intermediate	$[B, S, I]$	$[B, 1, I]$
logits	$[B, V]$ (last position)	$[B, V]$
Operation type	GEMM (matrix × matrix)	GEMV (matrix × vector)
Bottleneck	compute	memory bandwidth

This table is the starting point for understanding every inference acceleration effort: prefill is like training’s forward, compute-bound; decode is a chain of GEMVs, memory-bound, with most time spent fetching weights into SMs. The optimization directions of the two are worlds apart.

KV Cache Shape and Growth — Few KB Per Token · Hundreds of MB at Long Context

One pair of caches per layer:

K_{\text{cache}}, V_{\text{cache}} \in \mathbb{R}^{B \times n_{kv} \times S_{\max} \times d}

$S_{\max}$ — pre-allocated max sequence length (typically the model’s context cap or the scheduler’s limit)
Other symbols follow the top-of-article convention ( $B, n_{kv}, d$ ); the 4 dimensions follow PyTorch’s [B, head, seq, head_dim] ordering

Per-token, per-layer cache size (fp16):

2 \times n_{kv} \times d \times 2\ \text{bytes} = 2 \times 8 \times 128 \times 2 = 4\ \text{KB}

Leftmost $2$ — one for K, one for V
$2\ \text{bytes}$ — fp16 element size (fp8 / int8 cuts this to 1/2 or 1/4)
Right side plugs in Llama 3 8B: $n_{kv} = 8$ , $d = 128$

Per-token across the full 32-layer model: $4\ \text{KB} \times 32 = 128\ \text{KB} / \text{token}$ . A 4096-token request: $128\ \text{KB} \times 4096 \approx 512\ \text{MB}$ .

Several engineering optimizations:

Paged Attention (vLLM): split the cache into fixed-size blocks (typically 16 tokens), with a block table mapping virtual to physical addresses, eliminating fragmentation. The formulas don’t change; only the tensor layout and access pattern change.
Sliding Window Attention (Mistral): keep only the most recent $W$ tokens of K and V. Cache cap drops from $S_{\max}$ to $W$ , at the cost of information truncation, with long-range dependencies relayed through cross-layer stacking.
INT8 / FP8 KV Cache: quantize fp16 cache down to int8 or even fp8, per-channel or per-token quantization, with controllable error and cache footprint cut by 1/2 to 1/4. Representative work: KIVI / KVQuant.
KV compression / eviction (H2O, StreamingLLM, SnapKV): drop unimportant positions based on attention weights; used at very long context lengths.
MLA: mentioned earlier — modifies cache shape at the model-structure level, not as a postprocess.

Per-Step Compute and Memory Cost — Llama 3 8B fp16 · H100 Knee ~330 FLOPs/byte

The shape diagrams above show shapes but not magnitudes. 90% of inference optimization discussion is about “how many FLOPs does this step cost, how many bytes move,” so let’s lay each step’s cost into tables directly.

Using Llama 3 8B, fp16, $B=1$ as the baseline; for prefill take $S=2048$ ; for decode take $\text{cache\_len}=2048$ (some step during generation around the 2048th token).

Reference hardware knee: H100 SXM fp16 theoretical compute ~989 TFLOPs, HBM bandwidth ~3 TB/s, roofline knee $\text{AI}^{*} \approx 330\ \text{FLOPs/byte}$ . Above it is compute-bound, below is memory-bound.

Weight Distribution

Component	Shape	fp16 size	full model (× 32 layers)
Embedding $E$	$[V, H]$	1.0 GB	1.0 GB
$W_Q$	$[H, H]$	32 MB	1.0 GB
$W_K$	$[H, n_{kv}d]$	8 MB	256 MB
$W_V$	$[H, n_{kv}d]$	8 MB	256 MB
$W_O$	$[H, H]$	32 MB	1.0 GB
$W_{\text{gate}}$	$[H, I]$	117 MB	3.7 GB
$W_{\text{up}}$	$[H, I]$	117 MB	3.7 GB
$W_{\text{down}}$	$[I, H]$	117 MB	3.7 GB
RMSNorm $\boldsymbol{\gamma}$ (2 per layer)	$[H]\times 2$	16 KB	500 KB
LM head $W_{\text{lm}}$	$[H, V]$	1.0 GB	1.0 GB
Total		~432 MB / layer	~16 GB

Full-model fp16 weights are ~16 GB; the “floor price” of every forward pass is to scan these 16 GB from HBM. At H100’s 3 TB/s, $= 16/3000 \approx 5.3\ \text{ms}$ — this is the physical lower bound of single-request decode.

Per-Layer, Per-Step Compute / Memory I/O

Compare the same layer’s substeps under prefill ( $N=S$ ) and decode ( $N=1$ ). “Weight HBM” is the weight bytes fetched from VRAM; “KV HBM” is the KV-cache bytes read/written. Intermediate activations are assumed fused into kernels and not counted separately.

Step	Prefill FLOPs (S=2048)	Decode FLOPs (S=1)	Weight HBM	KV HBM
RMSNorm	$5BSH$ ≈ 42 MF	20 KF	$\boldsymbol{\gamma}$ 8 KB	—
$Q_{\text{proj}}$	$2BSH^{2}$ ≈ 68.7 GF	33.5 MF	$W_Q$ 32 MB	—
$K_{\text{proj}}$ (+write cache)	$2BSH \cdot n_{kv}d$ ≈ 17.2 GF	8.4 MF	$W_K$ 8 MB	W 4 MB / 2 KB
$V_{\text{proj}}$ (+write cache)	17.2 GF	8.4 MF	$W_V$ 8 MB	W 4 MB / 2 KB
RoPE	~50 MF	25 KF	—	—
Attn $QK^{\top}$	$2B n_q N L_k d$ ≈ 34.4 GF	16.8 MF	—	R 4 MB (decode)
softmax	~700 MF	260 KF	—	—
Attn $\cdot V$	34.4 GF	16.8 MF	—	R 4 MB (decode)
$W_O$	$2BSH^{2}$ ≈ 68.7 GF	33.5 MF	$W_O$ 32 MB	—
RMSNorm	42 MF	20 KF	$\boldsymbol{\gamma}$ 8 KB	—
$W_{\text{gate}}$	$2BSHI$ ≈ 241 GF	117 MF	$W_{\text{gate}}$ 117 MB	—
$W_{\text{up}}$	241 GF	117 MF	$W_{\text{up}}$ 117 MB	—
SiLU + gate	~90 MF	45 KF	—	—
$W_{\text{down}}$	$2BSIH$ ≈ 241 GF	117 MF	$W_{\text{down}}$ 117 MB	—
Per-layer total	~960 GFLOPs	~470 MFLOPs	~432 MB	W 8 MB (P) / R 8 MB (D)

A few direct conclusions:

FFN is the real protagonist. $W_{\text{gate}} + W_{\text{up}} + W_{\text{down}}$ consume ~75% of FLOPs and ~80% of weight bandwidth. MoE, sparse activation, and FFN quantization all target this block.
The 4 attention projections (Q/K/V/O) account for ~18%; the actual $QK^{\top}$ and $\cdot V$ only ~7% — in prefill, attention isn’t the bottleneck — the projections are.
Decode’s KV reads are 8 MB per layer; at $\text{cache\_len}=2048$ this is only ~2% of weight reads. But once context stretches to 64K or 128K, it grows tens of times, overtaking weight bandwidth as the new bottleneck (this is why Paged Attention, sliding window, and KV quantization exist).

One Full Forward Pass

Adding 32 layers + embedding + LM head:

Stage	FLOPs	HBM I/O	Arithmetic Intensity	Bottleneck
Prefill S=2048, B=1	~31 TFLOPs	~14 GB (weights) + 256 MB (KV write)	~2200 FLOPs/byte	compute
Decode step, cache_len=2048, B=1	~15 GFLOPs	~14 GB (weights) + 256 MB (KV read)	~1.05 FLOPs/byte	bandwidth
LM head (prefill, last position only)	~1 GFLOP	1 GB	~1 FLOPs/byte	bandwidth
LM head (decode)	~1 GFLOP	1 GB	~1 FLOPs/byte	bandwidth

Decode’s 1.05 FLOPs/byte is 2.5 orders of magnitude below H100’s knee of 330 — meaning ideal single-request decode compute utilization is only $1.05/330 \approx 0.3\%$ . This is the mathematical basis for continuous batching: push $B$ to 32 so the same weight read is amortized across 32 requests; arithmetic intensity scales by 32×, decode throughput grows almost linearly until the attention portion or compute itself becomes the wall.

Mental-Math Rules

Two rules cover 90% of inference performance estimation:

FLOPs ≈ $2 P N$ : $P$ is the parameter count (~8B); $N$ is the total number of tokens this forward pass processes. Each parameter is used once per token (one MAC = 2 FLOPs). E.g., prefill $S=2048$ : $2 \times 8\text{B} \times 2048 \approx 33\ \text{TFLOPs}$ , matching the itemized sum of 31 TFLOPs.
Weight HBM I/O ≈ $2 P$ bytes (fp16): one forward pass scans the model once, about 16 GB.

Arithmetic intensity is essentially $\frac{2 P N}{2 P} = N$ — the total number of tokens participating in this forward. Prefill has $S \cdot B$ tokens; decode has only $B$ . This single number directly determines why prefill and decode have different bottlenecks.

Compute Complexity Overview — With vs Without KV Cache Differ by Three Orders of Magnitude

Prefill (process $S$ tokens at once):

\text{FLOPs} \sim \underbrace{O(L \cdot S \cdot H^{2})}_{\text{linear layers}} + \underbrace{O(L \cdot S^{2} \cdot H)}_{\text{attention}}

$L \cdot S \cdot H^{2}$ — per layer, 4 $H \times H$ projections + 3 $H \times I$ FFN projections (with $I \sim 4H$ ), applied to $S$ tokens
$L \cdot S^{2} \cdot H$ — attention’s $QK^{\top}$ and $\cdot V$ , with the $S \times S$ score matrix
At short sequences linear layers dominate; once $S \gtrsim H$ the attention quadratic catches up

Decode per step (process 1 token, history $\text{cache\_len}$ ):

\text{FLOPs} \sim \underbrace{O(L \cdot H^{2})}_{\text{linear layers, constant}} + \underbrace{O(L \cdot \text{cache\_len} \cdot H)}_{\text{attention, linear in cache}}

$\text{cache\_len}$ — number of currently cached positions ( $= S + t - 1$ )
Single step processes 1 new token, so linear layers’ $S$ becomes $1$ ; attention still scans the full cache and grows linearly with it

Total complexity to generate $T$ tokens:

\text{FLOPs}_{\text{total}} \sim O\!\left(L \cdot T \cdot H^{2} + L \cdot T \cdot (S + T) \cdot H\right)

$T$ — number of generated tokens (top-of-article convention)
$(S + T)$ — average attention span (prompt + generated segment)
First term sums linear layers across $T$ decode steps; second term approximates the attention portion summed over $t = 1, \ldots, T$ (exact form is $\sum_t (S + t)$ )

Without KV cache: $O(L \cdot (S+T)^{3})$ — a massive difference.

Reading FLOPs and bandwidth together is even more illuminating — that’s what the previous section’s table shows: prefill challenges the compute ceiling; decode challenges the bandwidth ceiling; continuous batching’s point is to fuse $N$ requests’ decodes into one large GEMV, amortizing the weight-fetch cost across $N$ requests, with throughput rising linearly until compute or the attention portion becomes the bottleneck.

How Engineering Optimizations Plug into the Formulas — Flash Attn / Spec Decode / Continuous Batch

Flash Attention: mathematically equivalent to standard attention — the formulas don’t change a single character. Engineering-wise it fuses softmax and matmul into one kernel, updating softmax’s running statistics (max, sum) in a streaming manner over blocks, avoiding writing the $S \times S$ attention matrix back to HBM. Complexity unchanged; memory drops from $O(S^{2})$ to $O(S)$ ; speedup comes mainly from reduced HBM access. FA-2 shifted the partition granularity from heads to query blocks; FA-3 on H100/H200 adds warpgroup MMA + producer-consumer async pipelining.

Flash Decoding: at decode, Flash Attention’s Q has only one row, so kernel parallelism is too low. Flash Decoding splits the $\text{cache\_len}$ dimension of K and V into chunks for parallelism and then does a final log-sum-exp reduction. The formula is the same softmax, just split into two passes.

Speculative Decoding: a small “draft” model generates $k$ tokens sequentially, then the large model verifies them with one prefill over the $k$ positions. Acceptance rule:

\text{accept with prob } \min\!\left(1, \frac{p_{\text{target}}(x)}{p_{\text{draft}}(x)}\right)

$x$ — a candidate token produced by the draft model
$p_{\text{target}}(x)$ — probability the large (target) model assigns to $x$ at that position
$p_{\text{draft}}(x)$ — probability the small (draft) model assigns to $x$ at the same position
Combined with “on rejection, resample from $\max(0, p_{\text{target}} - p_{\text{draft}})$ ”, this rule provably yields the same sampling distribution as direct target-model decoding — zero quality loss

The crux is fusing $k$ decode GEMVs into one $k$ -length GEMM, turning the large model’s memory-bound regime back into compute-bound. With expected $\bar k$ accepted tokens per step, throughput scales by $\bar k$ (minus draft overhead). Variants: Medusa (multi-head prediction), EAGLE (feature-level draft), Lookahead Decoding (no draft model).

Continuous Batching (vLLM, TGI): instead of padding prefill to align at boundaries, schedule at the per-step request level. Each step picks a batch of requests in the same phase (prefill or decode), releases finished ones. Mathematically each request is independent; only the ordering changes. Original paper: OSDI’22’s Orca.

Chunked Prefill: split long prompts’ prefill into chunks and mix them with decode requests in the same step, reducing decode latency jitter. No formula changes. The core scheduling primitive of SARATHI / DistServe.

One Sentence Spanning the Whole Process — From Token ID to Next Token

Input token IDs → look up embedding → through $L$ layers (Pre-RMSNorm → Attention with RoPE → residual → Pre-RMSNorm → SwiGLU FFN → residual) → Final RMSNorm → LM Head → logits → sampling. In prefill, $S$ tokens pass in parallel, producing the first token + full KV Cache; in decode, each step inputs 1 token; at attention it reads historical K and V from the cache, while all other operations are per-token independent.

Key Engineering Invariants — 6 Rules Worth Memorizing

The main-line tensor shape is always $[B, S_{\text{current}}, H]$ . Residual structure preserves the dimension; whenever $H$ appears different somewhere, either it’s spread into heads inside attention, or lifted to $I$ inside FFN, and back to $H$ on exit.
K and V, once computed, never change. Because they’re linear projections $W_K, W_V$ applied to the already-fixed input $\mathbf{x}$ , and the causal structure ensures later positions cannot reach back to modify earlier representations. This is the mathematical basis for KV Cache.
Attention is the only cross-token operation; all others (norm, projection, FFN, activation) are per-token independent. So only K and V — the inputs to cross-token operations — need caching; everything else can be computed and discarded immediately.
Decode’s scores shape is $[B, n_q, 1, \text{cache\_len}]$ . The “1” is the Q side (the current new token), and the $\text{cache\_len}$ dim is eliminated when weighted-summing with $V$ , returning to one row.
Decode’s non-attention compute per step is constant; only attention grows linearly with cache length. So the true reason “generation gets slower over time” is that attention’s $\text{cache\_len}$ keeps growing, plus KV Cache pushing the memory footprint against HBM bandwidth limits.
Prefill uses GEMM; decode uses GEMV. This one-letter difference dictates that every inference engine has two kernel sets, two scheduling strategies. Internalize this and no inference-optimization paper will lose you.

Internalize these six and you’ll find that Flash Attention, PagedAttention, MLA, speculative decoding — they’re all local optimizations at some spot on this skeleton, while the skeleton itself has barely changed in twenty years.

References — Formulas · Papers · Engineering Blogs

Architecture and Core Operators

Vaswani et al., “Attention Is All You Need” (NeurIPS 2017) — the original Transformer paper. arxiv.org/abs/1706.03762
Shazeer, “GLU Variants Improve Transformer” (2020) — source of SwiGLU / GeGLU / ReGLU. arxiv.org/abs/2002.05202
Zhang & Sennrich, “Root Mean Square Layer Normalization” (NeurIPS 2019) — RMSNorm. arxiv.org/abs/1910.07467
Hendrycks & Gimpel, “Gaussian Error Linear Units (GELUs)” (2016) — GeLU definition and approximation. arxiv.org/abs/1606.08415

Positional Encoding

Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) — RoPE. arxiv.org/abs/2104.09864
Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models” (2023) — long-context RoPE scaling. arxiv.org/abs/2309.00071
bloc97 & emozilla, “NTK-Aware Scaled RoPE” — discussion of early NTK-aware open-source work. reddit / LocalLLaMA

Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need” (2019) — MQA. arxiv.org/abs/1911.02150
Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” (EMNLP 2023) — GQA. arxiv.org/abs/2305.13245
DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” (2024) — MLA introduced. arxiv.org/abs/2405.04434
DeepSeek-AI, “DeepSeek-V3 Technical Report” (2024) — MLA + MoE engineering. arxiv.org/abs/2412.19437

Mixture-of-Experts

Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (ICLR 2017) — foundational paper for top- $k$ gated MoE in deep nets. arxiv.org/abs/1701.06538
Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding” (ICLR 2021) — the standard aux-loss load-balancing recipe. arxiv.org/abs/2006.16668
Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (JMLR 2022) — top-1 + capacity factor. arxiv.org/abs/2101.03961
Du et al., “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts” (ICML 2022) — Google’s 1.2 T MoE. arxiv.org/abs/2112.06905
Jiang et al., “Mixtral of Experts” (Mistral AI, 2024) — Mixtral 8×7B tech report. arxiv.org/abs/2401.04088
DeepSeek-AI, “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” (2024) — fine-grained + shared-expert paradigm. arxiv.org/abs/2401.06066
Wang et al., “Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts” (2024) — the load-balancing approach used in DeepSeek V3. arxiv.org/abs/2408.15664
Meta AI, “The Llama 4 herd” (2025) — Llama 4 Scout / Maverick / Behemoth technical details. ai.meta.com/blog/llama-4

Flash Attention Series

Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (NeurIPS 2022) — FA-1. arxiv.org/abs/2205.14135
Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (2023) — FA-2. arxiv.org/abs/2307.08691
Shah et al., “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision” (NeurIPS 2024) — FA-3 / Hopper. arxiv.org/abs/2407.08608
Dao et al., “Flash-Decoding for long-context inference” (Stanford / Together blog, 2023) — Flash Decoding. crfm.stanford.edu

Inference Engines and Serving Schedulers

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023) — vLLM / PagedAttention. arxiv.org/abs/2309.06180
Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models” (OSDI 2022) — original Continuous Batching paper. usenix.org/osdi22
Agrawal et al., “SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills” (2023) — chunked prefill. arxiv.org/abs/2308.16369
Zhong et al., “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving” (OSDI 2024) — prefill/decode disaggregation. arxiv.org/abs/2401.09670
vLLM main repo (PagedAttention engineering implementation). github.com/vllm-project/vllm
Hugging Face Text Generation Inference (TGI). github.com/huggingface/text-generation-inference
NVIDIA TensorRT-LLM documentation (FA / in-flight batching). nvidia.github.io/TensorRT-LLM

Speculative Decoding Family

Leviathan, Kalman, Matias, “Fast Inference from Transformers via Speculative Decoding” (ICML 2023). arxiv.org/abs/2211.17192
Chen et al., “Accelerating Large Language Model Decoding with Speculative Sampling” (DeepMind, 2023) — concurrent independent work. arxiv.org/abs/2302.01318
Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” (2024). arxiv.org/abs/2401.10774
Li et al., “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” (ICML 2024). arxiv.org/abs/2401.15077
Fu et al., “Lookahead Decoding: Breaking the Sequential Dependency of LLM Inference” (2024). arxiv.org/abs/2402.02057

KV Cache Compression / Quantization

Xiao et al., “Efficient Streaming Language Models with Attention Sinks” (ICLR 2024) — StreamingLLM. arxiv.org/abs/2309.17453
Zhang et al., “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models” (NeurIPS 2023). arxiv.org/abs/2306.14048
Li et al., “SnapKV: LLM Knows What You are Looking for Before Generation” (2024). arxiv.org/abs/2404.14469
Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache” (ICML 2024). arxiv.org/abs/2402.02750
Hooper et al., “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization” (NeurIPS 2024). arxiv.org/abs/2401.18079

Representative Open-Source Model Technical Reports

Meta AI, “The Llama 3 Herd of Models” (2024) — Llama 3 family. arxiv.org/abs/2407.21783
Jiang et al., “Mistral 7B” (2023) — Sliding Window Attention. arxiv.org/abs/2310.06825
Qwen Team, “Qwen2.5 Technical Report” (2024). arxiv.org/abs/2412.15115

Hardware / Roofline

NVIDIA, “H100 Tensor Core GPU Architecture Whitepaper” (2022). resources.nvidia.com
Williams, Waterman, Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (CACM 2009) — original Roofline paper. dl.acm.org
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla, 2022) — training-side version of the $2PN$ FLOPs rule of thumb. arxiv.org/abs/2203.15556