Low-Precision Data Formats in Large Language Models
If you had to summarize the hardware trend in LLM training and inference over the past three years in a single sentence — training is moving from BF16 to FP8, and inference is moving from FP8 to FP4. Every time precision drops one notch, twice as many MAC units fit in the same silicon area, and LLM training throughput and inference tokens/s double. Underneath this main line is a whole spectrum of data formats: FP32 / TF32 / FP16 / BF16 / FP8 (two variants) / FP6 (two variants) / FP4 (MX and NV versions) / INT8 / INT4.
But “precision moving down” is never as simple as “swap one format for another across the whole model.” Within a single Transformer block, matrix multiplication may run in FP8, Softmax in FP32, and the KV cache may be compressed to INT8 — precision is mixed across operation types, not across layers. To see this clearly, we need to lay out every format’s bit structure, dynamic range, hardware support, and actual position in a Transformer.
This article lays that map out in one place, covering only the formats with native Tensor Core / matrix-engine hardware support — pure software emulation, research-stage, or paper-only formats (INT2, binarization, posits, etc.) are out of scope.
Why It Matters — Not Just Saving Memory
A lot of people’s first reaction to low precision is “it saves memory.” That’s only half right. What really drives NVIDIA / AMD / Google to keep pushing down on precision is another fact: each Tensor Core generation roughly doubles throughput at lower precision.
| Architecture | Flagship | Same-SM Throughput (Primary) | Same-SM Throughput (Lowest) |
|---|---|---|---|
| Ampere (2020) | A100 | 312 TF (FP16) | 624 TOPS (INT8) |
| Hopper (2022) | H100 | 990 TF (FP16) | 1979 TF (FP8) |
| Blackwell (2024) | B200 | 2250 TF (FP16) | 9000 TF (FP4) |
| Rubin (2026) | R100 | ~8000 TF (FP16) | 50,000 TF (FP4) |
Look at this table together — halve the bit width and the MAC count on the same silicon doubles; compute doubles directly. So low precision is not just “the same model uses less memory,” it’s “the same silicon can run bigger models.” This is the fundamental driver behind NVIDIA / AMD / Intel collectively pushing into FP8 / FP4 over the last three years.
Memory savings matter too — BF16 weights are half the size of FP32; INT4 quantized weights are a quarter of FP16. But if the dividend were memory only and not compute, the road wouldn’t go this far.
The Universal Floating-Point Representation — One Formula for All Formats
Any IEEE 754-style floating-point number consists of three parts:
- S (sign) — 1 bit, determines positive vs negative
- E (exponent) — bits, stored as a “biased” exponent; the real exponent is where
- M (mantissa / fraction) — bits, with an implicit leading “1.” for normal numbers
So the key parameters of a floating-point format are . The exponent count determines the dynamic range; the mantissa count determines the precision. Every format below fits this template.
Integer formats are another beast — no exponent at all, just signed or unsigned integers. Their range is linear, precision is uniform, but the representable range is fixed by the bit width. In practice, an integer format always pairs with an external scale factor that “stretches” or “shrinks” the integer values to the target numeric range. We’ll come back to that in the integer section.
Floating-Point Formats Side by Side — FP32 → FP4 at the Same Scale
First, a single chart with all nine floating-point formats aligned to the same scale — each cell is one bit. The exponent width determines dynamic range; the mantissa width determines precision. The trade-off between these two is the core design choice for every format:
FP32 / TF32 — The Baseline and an Engineering Trick — Same 8-bit Exponent, Different Mantissa
FP32 (1 + 8 + 23 = 32 bits) is the baseline for all modern CPUs / GPUs. 8-bit exponent, bias = 127, dynamic range ~ to , ~7 decimal digits of precision.
TF32 (1 + 8 + 10 = 19 bits, stored in a 32-bit register) is an engineering trick NVIDIA introduced on A100 — the exponent matches FP32 exactly (8 bits), and the mantissa is cut down to match FP16 (10 bits). Despite the name, its effective width is only 19 bits. The win: same dynamic range as FP32 → no loss scaling needed; only ~3 decimal digits of precision → significantly faster compute. On A100 / H100 / Blackwell, TF32 is the default replacement for FP32 GEMMs — you get the speedup without changing a single line of code.
FP16 / BF16 — Two 16-bit Formats for the Training Era — One Favours Precision, One Favours Range
FP16 (1 + 5 + 10 = 16 bits) is IEEE 754 half-precision: 5-bit exponent, bias = 15, dynamic range ~ to , 3–4 decimal digits of precision. For training, this range is dangerously narrow — gradients on the order of underflow to zero. So FP16 training must be paired with loss scaling (scaling up the loss to a magnitude that won’t underflow before backpropagation).
BF16 (1 + 8 + 7 = 16 bits) was introduced by Google Brain on TPU v2. The design idea is straightforward — chop 16 bits off the FP32 mantissa, keep all 8 exponent bits. As a result:
- Dynamic range is essentially the same as FP32 → loss scaling is no longer needed; BF16 is almost a drop-in replacement for FP32
- Precision is only 2–3 decimal digits → even lower than FP16
- FP32 ↔ BF16 conversion is just truncation / padding → essentially free in hardware
This is why, after 2020, large-model training overwhelmingly chose BF16 over FP16 — LLM training is far more sensitive to dynamic range than to precision. That single lesson is the direct prototype for the FP8 design that followed.
FP8 — Two Variants With a Division of Labour — E4M3 Forward · E5M2 Backward
FP8 was jointly proposed by NVIDIA / Arm / Intel in 2022 (arxiv 2209.05433); it has native support on H100 / MI300 / Gaudi 2/3. It defines two variants simultaneously:
| Variant | Bit layout | Dynamic range | Precision | Use |
|---|---|---|---|---|
| E4M3 | 1+4+3 | (max) · min normal | higher | Forward: weights / activations |
| E5M2 | 1+5+2 | · min normal | lower | Backward: gradients |
Why two variants? Because activations and gradients have very different dynamic ranges. Activations, after passing through a LayerNorm, usually cluster in a tight range — well-suited to E4M3’s “precision-heavy, range-light” trade-off. Gradients can span many orders of magnitude, requiring E5M2’s “range-heavy, precision-light” design.
Note that E4M3 doesn’t strictly follow IEEE 754 — it drops inf, repurposing the inf encoding to extend the numeric range (max value 448 instead of 240). E5M2 strictly follows IEEE 754, with both inf and NaN.
FP8 scaling: 8 bits alone are nowhere near enough to cover the dynamic range encountered in training. So FP8 is almost always paired with an FP32 per-tensor scaling factor (auto-maintained by H100’s Transformer Engine); the actual value stored is . We’ll come back to this when discussing DeepSeek-V3’s finer-grained version.
FP6 and FP4 — The Microscaling Era — Unusable Alone · Must Share Block Scales
FP6 and FP4 are part of the OCP (Open Compute Project) Microscaling (MX) Format spec released in 2023; Blackwell is the first generation of hardware to support them in Tensor Cores.
FP6 has two variants: E3M2 (more range, less precision) and E2M3 (less range, more precision), chosen by use case.
FP4 has two flavours in Tensor Cores, both using the E2M1 bit layout, differing in scale granularity:
| Version | Bit layout | Block size | Block scale | Outer scale |
|---|---|---|---|---|
| MXFP4 (OCP) | E2M1 (4 bit) | 32 elements | E8M0 (8 bit, pure exponent) | — |
| NVFP4 (NVIDIA) | E2M1 (4 bit) | 16 elements | FP8 E4M3 (8 bit) | FP32 per-tensor |
Why microscaling? Because FP4 has only 4 bits — only 16 distinct values (including signed zeros). Without sharing a scale across a block, those 16 values can’t possibly cover the real distribution of model weights. MX turns “per-tensor scaling” into “per-block scaling” — every 32 numbers share an 8-bit exponent scale, building fine-grained quantization right into the hardware.
NVIDIA pushed this further with NVFP4 — block size shrunk to 16, block scale upgraded from pure-exponent E8M0 to the more precise FP8 (E4M3), with an additional FP32 per-tensor scale on top. The three-layer scale structure makes NVFP4 markedly more accurate than MXFP4, at the cost of slightly more metadata overhead. On Blackwell, NVFP4 has been measured to keep inference accuracy close to FP8 — the strongest evidence yet that FP4 has finally become a “usable format.”
Integer Formats INT8 / INT4 — No Exponent · Entirely Reliant on Scale
You can think of integer formats as a special case of floating-point — “all mantissa, zero exponent.” Precision is uniform, but dynamic range is entirely determined by an external scale. INT8 and INT4 dominate the consumer / deployment side of LLM inference.
The integer-to-float mapping in practice is:
where is an FP16 / FP32 scale and is the zero point (symmetric quantization has ; asymmetric has ). For INT8, per-tensor or per-channel scaling is typically enough; INT4 essentially demands group-wise scaling — every 32 / 64 / 128 numbers share a scale, otherwise precision loss is too steep. GPTQ (arxiv 2210.17323) and AWQ (arxiv 2306.00978) are at heart “find a set of group-wise scales” algorithms.
Dynamic Ranges Lined Up — Every Format on the Same Log Axis
Put every floating-point format’s dynamic range on the same log axis and the differences become instantly visible:
A few observations worth pulling out:
- “Range-heavy” vs “precision-heavy” is visible at a glance. At the same bit width, formats with more exponent bits (BF16, FP8-E5M2) suit training — they tolerate gradients spanning many orders of magnitude. Formats with more mantissa bits (FP16, FP8-E4M3, FP6-E2M3) suit inference — weight distributions are known, and precision is more valuable.
- The lower the bit width, the more the range collapses to a point. FP4 can only cover the interval 0.5–6. It can be used in LLM inference only because an external block-scale “stretches” or “shrinks” that little bar to match the target numeric range.
- TF32 / BF16 are “truncated FP32”. Stack them with FP32 and the left and right ends of the bars match exactly — only the precision granularity differs. That’s why FP32 ↔ BF16 / TF32 conversion is essentially free in hardware.
Precision Allocation Inside a Transformer Block — Training vs Inference
A model is never “all one precision” — each component of a Transformer picks its own precision based on the operation’s characteristics. The chart below puts training and inference side by side:
A few non-obvious details:
- GEMMs are essentially the sole battlefield for low precision. Inside a Transformer block, the operations that actually move to lower bit widths are the matrix multiplications — Q/K/V projection, the two batched matmuls in attention, the output projection, and the two FFN linears. The other components (norm / softmax / residual / activation) can also be precision-reduced, but the payoff is small compared to GEMMs — most of an LLM’s compute and parameters live in those matmuls.
- Softmax / Norm / Residual barely move. The common thread is “sensitive to numeric range or accumulated error” — softmax computes and exp is exquisitely sensitive to the digits after the point; norm computes sums of squares and risks overflow; residual is the critical accumulation path across N layers, where any error gets amplified. So they all stay in FP32 on the training side.
- The KV cache is an inference-specific optimization point. With long contexts, the KV cache can eat more memory than the model weights themselves. Compressing it separately to FP8 / INT8 is a routine optimization in vLLM, TensorRT-LLM, and SGLang.
- The training “14 bytes/param” rule. BF16 weights 2 bytes + FP32 master 4 bytes + Adam first moment 4 bytes + Adam second moment 4 bytes = 14 bytes/param. That’s why training a 70B model needs at least ~1 TB of GPU memory before activations.
Microscaling MX — The Key to Making FP4 / FP6 Actually Usable
On the dynamic-range log axis above, FP4 / FP6’s bars shrink almost to a point. They can run on hardware only because of an external block-scaling layer.
The OCP MX spec defines a remarkably simple structure — every 32 numbers share an 8-bit exponent scale:
Block size: 32 elements
Per element: FP4 (4 bit) or FP6 (6 bit) or FP8 (8 bit)
Block scale: E8M0 (8 bit, pure exponent, no sign, no mantissa)
E8M0 is a special “pure exponent” format — all 8 bits store an exponent, no sign and no mantissa, representing a scaling factor of . A block of 32 FP4 numbers carries an actual value of .
Why 32 elements per block, and why a pure-exponent scale? Engineering trade-offs — smaller blocks improve accuracy but burn more metadata; a pure-exponent scale turns scaling into a shift in hardware, which costs basically nothing. Blackwell’s FP4 Tensor Cores can run this format natively precisely because the block-scale decoder is built into the silicon.
NVIDIA went further with NVFP4 — block size cut to 16, block scale upgraded from pure-exponent E8M0 to the finer FP8 (E4M3), with an additional FP32 per-tensor scale on top. The three-layer scale structure gives NVFP4 markedly better effective precision than MXFP4, at slightly higher metadata cost. On Blackwell, NVFP4 inference accuracy has been measured to be close to FP8 — the key evidence that FP4 has finally become a “usable format.”
Hardware Support Matrix — V100 → Rubin · TPU · MI300
Low precision can only “take off” with hardware acceleration — without native Tensor Core / matrix-engine support, low precision saves memory but doesn’t run any faster. The matrix below shows, for each format and each accelerator generation, when it first got native support:
A few patterns worth pulling out:
- The green diagonal. Connect all the solid green dots and you get a descending staircase — FP16 (2017) → BF16 (2020) → FP8 (2022) → FP4 (2024), one notch down every ~2 years. This is the “bit width halves every two years” hardware cadence.
- FP32 / FP16 / INT8 are the three “common base” rows. Every vendor supports them. If your code sticks to these three, deployment to any accelerator works; go further down and you lock in to specific vendors and generations.
- AMD lags NVIDIA by about one generation. MI300 (2023)‘s format coverage matches NVIDIA’s H100 in 2022; MI350 (2025) finally adds FP6 / FP4.
- TPU’s “restraint”. Google validated BF16 + INT8 on their own workloads and decided it’s enough, so TPU has skipped FP8 / FP4. This is an interesting design stance — when you’re both the hardware and the model designer, you can choose “software bends to hardware” instead of “hardware chases software.” Gemini’s BF16-based training stack is a direct consequence of that choice.
Rubin — No New Formats · Just Pushing FP4 to 50 PFLOPS
NVIDIA’s next-gen architecture Rubin (announced at GTC 2024, shipping 2026) has a few judgments worth recording:
- No new formats introduced. Rubin keeps the same FP64 / FP32 / TF32 / FP16 / BF16 / FP8 / FP6 / FP4 + INT8 / INT4 lineup as Blackwell. NVIDIA didn’t invent new precisions this generation — the engineering focus is doubling FP4 throughput and making FP4 truly usable.
- Performance skew is sharp. Rubin’s FP4 / FP8 throughput is ~3.5× over GB200, while FP16 is only ~1.6×. This clearly reflects NVIDIA’s prediction — most training and inference workloads will migrate from BF16 / TF32 to FP8 / FP4. If you’re still on BF16, Rubin’s speedup is much smaller than the Hopper → Blackwell jump.
- Adaptive sparsity replaces 2:4 sparsity. Earlier generations’ structured sparsity was barely used (forcing half of values to zero hurt accuracy). Rubin’s adaptive sparsity engine dynamically detects and removes zero values in the dataflow without forcing non-zero values to zero — preserving model accuracy while improving performance. The 50 PFLOPS FP4 figure NVIDIA quotes is driven by this engine — the closer to maximum sparsity, the closer to the peak.
- Rubin CPX is an interesting design. A companion chip dedicated to long-context prefill, skipping expensive HBM4 in favour of 128 GB GDDR7. This reflects the industry’s recognition that prefill and decode have very different hardware needs — prefill is compute-bound and wants FP4 throughput; decode is memory-bandwidth-bound and wants large memory.
Precision Choices in Mainstream LLMs — Llama 4 · DeepSeek-V3 · Gemini · Claude
Putting the major LLMs’ training and inference precisions side by side:
A few observations worth unpacking:
- DeepSeek-V3 was the FP8-training icebreaker (arxiv 2412.19437). Before it, the industry was sceptical that FP8 — with 8 fewer bits than FP16 — could stably train a large model at all. DeepSeek-V3’s key engineering contribution was fine-grained quantization — 1×128 tile-wise grouping for activations and 128×128 block-wise grouping for weights, using finer scale granularity to suppress outliers. This was the inflection point from “FP8 training is lab-feasible” to “a 671B-parameter MoE finishes training stably in FP8.”
- Llama 4 made FP8 training mainstream. Meta pre-trained Llama 4 Behemoth (in the 2-trillion-parameter range) in FP8, hitting ~390 TFLOPS/GPU on a 32K H100 cluster. That’s the moment FP8 training went from “one DeepSeek outlier” to “established big-lab method.”
- Why Gemini stays on BF16 — Google TPU has no native FP8 Tensor Core, so Gemini’s training stack is built around BF16. This matches the TPU row’s “restraint” from the hardware matrix above.
- Closed models are a black box on precision. OpenAI and Anthropic don’t publish training-precision details. The “BF16 (assumed)” tags for GPT-4 and Claude are inferences from hardware generation, not public information. Whether GPT-5 uses FP8 is similarly unclear.
- Inference is already “two-track”. Server deployments (vLLM, TensorRT-LLM, SGLang) lean toward FP8 — saves memory and runs faster on H100/B200. Local deployments (llama.cpp, Ollama) lean toward INT4 GPTQ / AWQ, because FP4 hardware is still rare in consumer GPUs.
Summary — Training BF16 → FP8 · Inference FP8 → FP4
The current LLM data-format trend in one sentence: training is on the BF16 → FP8 road; inference is on the FP8 → FP4 road; and each landing depends on a new generation of Tensor Core hardware support.
Boiling the article down:
- Bit-width reduction isn’t just memory savings — it’s compute conversion. On the same silicon, every precision step down doubles the MAC count. This is the root reason Tensor Core throughput grows logarithmically.
- The exponent vs mantissa trade-off runs through every format. Training prefers more exponent (BF16, FP8-E5M2) to tolerate gradient range; inference prefers more mantissa (FP8-E4M3, FP6-E2M3) because weight distributions are known and precision is more valuable.
- The lower the bit width, the more it needs block-scale. FP8 can still get away with per-tensor scaling; FP6 / FP4 must use block-wise (MX / NVFP4), or single values simply can’t express enough. Integer formats are perpetually dependent on external scales.
- Precision mixes across operations, not across layers. Every component in a Transformer block chooses precision based on its operation — GEMMs are the main battlefield for low precision; softmax / norm / residual stay high-precision almost always.
- Three threads drive the next hardware generation. NVIDIA pushes FP4 / FP6 throughput (Rubin); AMD chases FP8 / FP4 (MI350); Google “restrains” on BF16 + INT8 (TPU). These three paths set the training and deployment cost curve for the next 3–5 years of large models.
Threads to dig further on next time — DeepSeek-V3’s FP8 fine-grained scaling engineering details; NVFP4 vs MXFP4 accuracy comparison across workloads on Blackwell; the “software / hardware co-design” philosophy behind TPU’s BF16 stance.
References — Standards · Papers · Engineering Blogs
Standards and White Papers
- OCP Microscaling Formats (MX) Specification v1.0 (2023) — Official spec for MXFP4 / MXFP6 / MXFP8 and the E8M0 block scale. opencompute.org · MX v1.0 spec
- NVIDIA / Arm / Intel FP8 White Paper — “FP8 Formats for Deep Learning” (2022) — design motivation for the E4M3 / E5M2 variants and training experiments. arxiv.org/abs/2209.05433
- NVIDIA Blackwell Architecture Whitepaper (2024) — second-gen Transformer Engine, NVFP4 block structure, FP4 Tensor Core design details.
- NVIDIA Hopper Architecture Whitepaper (2022) — first-gen Transformer Engine, FP8 Tensor Cores, per-tensor scaling mechanism.
Papers — Training Side
- DeepSeek-V3 Technical Report (2024) — the first published engineering report on ultra-large-scale FP8 training, with detailed fine-grained tile-wise / block-wise scaling and outlier handling. arxiv.org/abs/2412.19437
- The Llama 3 Herd of Models (Meta, 2024) — engineering practice of training a 405B model in BF16 on a 16K H100 cluster. arxiv.org/abs/2407.21783
- Transformer Engine: Mixed-Precision FP8 Training (NVIDIA, 2022+) — delayed scaling mechanism for FP8 training on H100, the default dependency for Megatron-LM / NeMo. github.com/NVIDIA/TransformerEngine
Papers — Inference Quantization
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers, 2022) — the first stable INT8 quantization scheme at LLM scale; introduces the outlier observation and its handling. arxiv.org/abs/2208.07339
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) — INT4 weight quantization based on second-order information, layer-by-layer optimal scale search. arxiv.org/abs/2210.17323
- AWQ: Activation-aware Weight Quantization (Lin et al., 2023) — group-wise scale choice based on activation magnitudes; consistently outperforms GPTQ. arxiv.org/abs/2306.00978
- SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs (Xiao et al., 2022) — “smooths” activations to migrate outliers into weights, making W8A8 quantization viable. arxiv.org/abs/2211.10438
Engineering Blogs and Docs
- NVIDIA Developer Blog · Transformer Engine — practical tutorials on FP8 / FP4 training on H100 / B200. developer.nvidia.com
- vLLM Docs · Quantization — open-source implementation details for INT8 / FP8 / INT4 inference quantization. docs.vllm.ai · Quantization
- Tim Dettmers Blog — the most authoritative engineering perspective on LLM quantization; author of the bitsandbytes library. timdettmers.com
- Hugging Face · Quantization Guide — unified entry point for GPTQ / AWQ / BitsAndBytes in the transformers library. huggingface.co · quantization