Storage in CPUs and GPUs — Types, Process, and Why It Is Designed This Way

If you had to sum up storage in CPUs and GPUs in one sentence — it is a pyramid that runs from registers to mechanical disks: the higher you go the faster, more expensive, closer to the compute units, and more volatile it gets; the lower you go the slower, cheaper, farther from compute, and more persistent. The whole edifice exists for one purpose only — to keep the compute units from sitting idle waiting for data.

But behind this main thread hide a few easily overlooked facts: first, the thing called “SRAM” is built in completely different ways in the L1 cache versus the L3 cache; second, the DDR sticks in a PC, the GDDR on a graphics card, and the HBM on an AI accelerator are all fundamentally DRAM — the difference is entirely in packaging; third, the explosive capacity growth of SSDs in recent years comes not from “shrinking the cell” but from “stacking more layers” — these are three independent process dimensions.

This article unfolds along two threads. By hardware — crack open a computer and see what storage lives inside the CPU, the memory sticks, the graphics card, the SSD, and the HDD. By type — explain the circuit, process, and variants of each storage type, and along the way answer “why is this storage so fast (slow) / expensive (cheap) / volatile (persistent).” The two threads meet in the middle via a mapping table, so you can enter from either side and cross over to the other.

I · Overview — What Storage Exists · Where It Is Used

The Storage Pyramid — Seven Tiers from Fast to Slow

The whole storage system is arranged by “speed from fast to slow, capacity from small to large, cost per unit from expensive to cheap” — roughly seven tiers. This spectrum starts inside the silicon of the CPU/GPU chip and extends all the way out of the case, even out of the data center:

TierStorageTypeLatency / BandwidthCapacity scaleL1Registeroperated directly by instructionsSRAM class (latch / FF)6-8 transistors per bit< 1 nssub-ns latencyKBL2L1 cacheprivate per coreSRAM (high-performance 6T)large transistors + multi-port≈ 1 ns / 4-5 cyclesseveral TB/stens of KBL3L2 cacheper core on CPU / per SM on GPUSRAM (high-density 6T)compromise sizing≈ 3 ns / 10-15 cycleshundreds of KB to MBL4L3 cache (CPU only)shared across all cores · GPUs usually noneSRAM (densest 6T)stackable since 3D V-Cache≈ 10 nstens of MB to 1 GBL5Main memory / VRAMDDR / LPDDR / GDDR / HBMDRAM (1T1C)separate fab line · needs refresh≈ 50-100 nstens of GB/s to several TB/sGB to hundreds of GBL6Solid-state drive (SSD)system disk / data diskNAND Flash (3D stacked)floating gate / charge trap10-100 μs · several GB/shundreds of GB to tens of TBL7Mechanical disk (HDD)cold data / archivemagnetic recording (non-semiconductor)head + spinning platter5-15 ms · hundreds of MB/sseveral TB to tens of TBgoing upfasterpricier / per unit capacitysmallerlost on power-offgoing downslower · largercheapersurvives power-offSRAM class (on-die)DRAM (off-die)NAND Flashmagnetic recording
The seven-tier storage pyramid — the top four tiers (orange) are the SRAM family, etched into the silicon of the CPU/GPU; tier 5 (blue) is the DRAM family, independent chips plugged into the motherboard or stuck next to the GPU; tier 6 (green) is NAND Flash; tier 7 (red) is magnetic recording, which is not even on the same track as semiconductors. Latency climbs from sub-nanosecond all the way to tens of milliseconds, spanning 7 orders of magnitude.

An Analogy — Desk · Bookshelf · Bookcase

You can picture this hierarchy as a person working at a desk — the registers are the sheet of paper you are writing on, the cache is the few books lying open within reach, main memory is the desk, and the SSD/HDD is the bookcase in the room. You can keep the least on hand but grab it fastest; the bookcase holds the most but every trip means getting up and walking over. The central design problem of a computer is to figure out how to keep the things you use most often as close to your hand as possible, minimizing the number of trips to the bookcase.

This also explains a common phenomenon — when memory runs short, the system temporarily moves some memory data onto the disk (swap / virtual memory), and things get noticeably sluggish. It is as if something you should have been able to grab off the desk has been stuffed into the bookcase, and you have to walk over to fetch it every time.

Two Perspectives — How to Read This Article

Understanding storage takes more than one pyramid diagram. The same kind of storage can differ enormously across different hardware and across different processes. So the following uses two perspectives to dig in:

  • By hardware — crack open a computer and see what storage lives inside the CPU chip, the memory sticks, the graphics card, the SSD, and the HDD, and what type each one is. This section answers “what storage is on this piece of hardware in my hands.”
  • By type — explain the circuit principle, family variants, manufacturing process, and “why it is this way” of each storage type all at once. This section answers “why is this storage so fast / slow / expensive / cheap / volatile.”

The two sides connect via a mapping table at the end of Part 2, so you can enter from either side and cross over to the other.

II · By Hardware — What Storage Each Piece of Hardware Holds

An Ownership Map — Crack Open a Computer

Understanding storage means knowing not just “what it is” but also “where it is.” Crack open a single machine and storage is scattered across several independent pieces of hardware, each serving a different master.

MOTHERBOARDCPU chipsingle die (or chiplets)Register + L1 (per core)L2 (per core)L3 (shared by all cores)all SRAM etched on the dieMemory stick (DIMM)motherboard slot · removableDDR5 dies ×8-16DRAM, independent chipsserves CPU main memoryfirmware chip on boardNOR Flash · soldered downBIOS / UEFIGraphics card / AI acceleratorseparate PCB · attached via PCIeGPU chipRegister + L1 + L2(all SRAM etched on the die)+ SMEM / TMEMHBMstack(data-center card)HBMstackGDDR (consumer) / HBM (data center)SSDstandalone device · M.2 / U.2NANDNANDNANDNANDNANDNAND+ controller + DRAM cache3D NAND Flash diesHDDstandalone device · SATAmagnetic recording · non-semiconductor
Hardware ownership of storage in a typical computer — the orange block (CPU chip) integrates registers + three levels of cache, all SRAM; green is the memory stick, soldered with DDR dies, an independent removable DRAM chip; the blue graphics card has a GPU chip holding register/L1/L2 + SMEM, surrounded by GDDR dies (consumer) or HBM stacks pressed right against the GPU (data center); red is the SSD, internally NAND Flash dies. The HDD is yet another standalone device, using magnetic recording. The BIOS / UEFI is a small piece of NOR Flash soldered onto the motherboard.

Summarizing the hardware in this diagram into a table — the left side is the hardware part, the right side is the storage type it holds. Part 3 unfolds by these types, and the rightmost column carries anchors that jump straight there:

HardwareInternal storage partStorage type (variant level)Volatile/non-volatileSee type section
CPU chipRegisterregister file (multi-port flip-flop)volatileRegister File
L1 / L2 cachecache 6T SRAM (high-performance tuning)volatileCache 6T SRAM
L3 cachecache 6T SRAM (high-density) / 3D V-Cachevolatile3D V-Cache
Memory stick (DIMM)DDR5 diesDDRvolatileDDR
Graphics card / AI acceleratorGPU on-die register / L1 / L2 / SMEMregister file + cache 6T SRAMvolatileCache 6T SRAM
GDDR dies (consumer)GDDRvolatileGDDR
HBM stack (data center)HBMvolatileHBM
SSDNAND Flash dies (main storage)3D NAND Flashnon-volatile3D NAND Flash
DRAM cache (optional)DDR / LPDDRvolatileDDR
HDDplatters + headsmagnetic recording (PMR / HAMR)non-volatileHDD Magnetic Recording
internal DRAM cacheDDRvolatileDDR
MotherboardBIOS / UEFI chipNOR Flashnon-volatileNOR Flash
embedded EEPROM (NIC etc.)NOR Flash / Flash familynon-volatileNOR Flash
Embedded / automotiveMCU embedded non-volatileMRAM / ReRAM (NOR replacement)non-volatileMRAM

A few seemingly counterintuitive details: an SSD is actually a composite of NAND + DRAM + controller; an HDD also has a small DRAM cache; the GDDR/HBM on a graphics card and the DDR on a memory stick are all fundamentally DRAM; the BIOS on the motherboard uses NOR Flash rather than NAND — because NOR supports byte-granularity random reads and is suited to running code directly, whereas NAND must be read in blocks and is suited to storing large chunks of data.

The hardware is unpacked piece by piece below.

CPU Chip — Register + L1/L2/L3 All SRAM

A CPU chip is a single piece of silicon (or several chiplets packaged together). Besides the execution units, instruction decoders, and other logic circuits, all storage inside the CPU — registers, L1, L2, L3 cache — is SRAM, all etched directly onto the silicon.

  • Register — pressed against the execution units, sub-nanosecond latency, only KB-scale capacity. Almost every instruction touches it.
  • L1 cache — private to each core, tens of KB, ~1 ns latency. Split into instruction cache (I-Cache) and data cache (D-Cache).
  • L2 cache — private to each core, hundreds of KB to MB scale, ~3 ns latency.
  • L3 cache — shared across all cores, tens of MB up to a GB, ~10 ns latency. AMD’s 3D V-Cache can stack another layer of SRAM on top.

Why doesn’t the CPU integrate a larger-capacity DRAM inside the chip to serve as main memory? Because the DRAM process and the logic process are incompatible (capacitor process vs standard transistors), and forcing it into the CPU chip would be enormously costly. So the storage inside the CPU can only be SRAM — process-compatible and integrable, but with large cells that eat area. The L3 SRAM cache often occupies a sizable fraction of the die on a modern CPU.

Real die shot of an Intel i9-13900K Raptor Lake
Delidded die shot of an Intel Core i9-13900K (Raptor Lake, Intel 7 process) — the large bright region of regular texture in the middle is the L3 SRAM cache array, flanked by the P-Core / E-Core clusters. The SRAM cache takes up a sizable fraction of the die area, which is precisely the physical embodiment of “SRAM is expensive per unit area.” (Source: Wikimedia Commons, Fritzchens Fritz, CC0 public domain)

Memory Stick (DIMM) — DDR5 DRAM Dies

A memory stick is an independent module plugged into a motherboard memory slot, serving the CPU’s main memory. It is essentially a small PCB with 8-16 DRAM dies soldered on, communicating with the motherboard through a 288-pin edge connector.

Real photo of a DDR5 memory stick — a DRAM module seated in a motherboard slot
A range of DDR5 memory stick form factors shown by SK Hynix (UDIMM / SODIMM / CAMM2 / MRDIMM, etc.) — each stick has multiple DRAM dies soldered on, inserted into a motherboard slot via the edge connector. (Source: Wikimedia Commons, 4300streetcar, CC BY 4.0)

It is worth noting that the DRAM dies on a memory stick, the GDDR dies on a graphics card, and the HBM stacks on an AI accelerator are all fundamentally DRAM — only the packaging and interface processes differ. The specific differences are in the DDR / GDDR / HBM sections.

Graphics Card / AI Accelerator — GPU + VRAM

A graphics card is a standalone PCB attached to the motherboard through PCIe. It carries two kinds of storage:

The first kind is the GPU chip itself. Like the CPU, all storage inside the GPU chip is also SRAM — registers, L1 / L2 cache, shared memory (SMEM), and the Tensor Memory (TMEM) newly added since Blackwell. The difference is that GPUs generally have no L3 cache, and SMEM / TMEM are SRAM that the programmer can explicitly manage (CPU caches are transparent to software). The GPU register file is made especially large (256 KB per SM) because it has to feed thousands of threads at once.

The second kind is VRAM, i.e. the DRAM stuck next to the GPU. There are two kinds by packaging method:

  • GDDR (consumer) — independent DRAM dies soldered onto the PCB around the GPU, optimized for graphics bandwidth. Gaming cards like the RTX 5090 use GDDR7.
  • HBM (data center) — multiple DRAM dies stacked vertically, co-packaged with the GPU pressed tightly together through a silicon interposer. The VRAM on AI accelerators like the H100 / B200 / MI300 / TPU is all HBM.

The photo below gives a direct view of how HBM sits with the GPU:

Real photo of HBM stacks and interposer on an AMD Fiji GPU package
Real photo of an AMD Fiji GPU package — the central square die is the GPU, the four rectangular stacks around it are the HBM VRAM, and the GPU and HBM all sit on a single silicon interposer. The whole package is only palm-sized, yet the VRAM bandwidth can reach several TB/s. (Source: Wikimedia Commons, C. Spille / pcgameshardware.de, CC BY-SA 4.0)

The storage cells inside GDDR and HBM are fundamentally the same as the dies on a DDR memory stick; the difference is entirely in the packaging process. The TSV stacking + silicon interposer details of HBM are in the HBM section.

SSD — NAND Flash Dies + Controller

An SSD is a standalone device attached to the motherboard through SATA (the old interface) or NVMe / PCIe (the mainstream). Internally it is quite simple — a small PCB with a few NAND Flash dies soldered on, plus an SSD controller chip, plus a small DRAM cache (optional, used to cache the mapping table).

Real photo of NAND Flash dies on the PCB inside a SanDisk SSD
The PCB of a SanDisk SDSSDA-120G SSD after disassembly — a few NAND Flash dies and a controller chip are soldered on. A single die contains 200+ layers of 3D NAND stacking. (Source: Wikimedia Commons, Raimond Spekking, CC BY-SA 4.0)

Note that an SSD is a “composite” — the main storage is NAND Flash (non-volatile, the place that actually stores data), but inside there is also a small DRAM acting as a cache to speed up metadata access, plus an ARM core running the controller firmware. A single modern NAND die holds 200+ layers of 3D NAND stacking — this is the root cause of the rapid capacity inflation of SSDs in recent years, see the 3D NAND Flash section.

HDD — Platter + Head · The Only Non-Semiconductor

A mechanical disk is yet another standalone device, attached to the motherboard through SATA. It is the odd one out in the entire storage pyramid — it is not a semiconductor at all, but magnetic recording. Inside there are: several high-speed spinning metal platters (7200 / 15000 RPM, surfaces coated with magnetic material), plus a floating head (a few nanometers above the platter).

Real photo of the platters and head inside a Seagate mechanical disk
A Seagate Barracuda Green mechanical disk with the casing removed — the high-speed spinning metal platters + the read/write head floating on its arm, the head only a few nanometers above the platter surface. This is the only tier in the entire storage pyramid that does not work on semiconductors. (Source: Wikimedia Commons, Raimond Spekking, CC BY-SA 4.0)

Reading relies on sensing changes in the magnetic field; writing relies on changing the magnetization direction of a small magnetic domain below. The mechanical motion dictates its millisecond-scale access latency, but its cost per unit capacity is the lowest of the bunch.

Other — Motherboard NOR Flash · Embedded Storage

There are a few more inconspicuous but necessary bits of storage in a computer:

  • The BIOS / UEFI firmware chip on the motherboard — a small piece of NOR Flash soldered onto the board, holding the first chunk of code executed at boot. NOR Flash works on the same principle as the NAND Flash used in SSDs, but with a different circuit topology (NOR cells are in parallel, allowing byte-granularity reads), making it suited to firmware-style low-capacity, low-speed scenarios.
  • The EEPROM in NICs, sound cards, and USB controllers — stores configuration and small chunks of code, from a few KB to a few MB, all in the Flash family.
  • The “VRAM” of an integrated GPU (iGPU) — it has no dedicated VRAM and carves out a chunk of system memory to use as VRAM. This is why “VRAM” and “main memory” are the same kind of DRAM in the integrated-GPU case.
  • The microcode ROM of CPUs / GPUs — a small read-only store holding instruction-decode logic, burned in at the factory and not updatable.

III · By Type — Circuit · Process · Variants

An Overview Table — Six Types at a Glance

TypeCell structureVolatile/non-volatileTypical speedTypical capacityMainly used where
SRAM6T bistable (or 8T)volatile< 1 nsKB ~ tens of MB / chipCPU/GPU cache, registers
DRAM1T + 1Cvolatile≈ 50-100 nsGB ~ hundreds of GB / cardmain memory (DDR), VRAM (GDDR / HBM)
NAND Flashfloating gate / charge trap (3D stacked)non-volatileμs ~ msTB / driveSSD, USB stick
NOR Flashfloating gate (cells in parallel)non-volatiletens of ns readKB ~ MBBIOS / UEFI, embedded firmware
Magnetic recording (HDD)magnetic domain directionnon-volatilems scaletens of TB / drivecold data / archive
Emerging NVM (MRAM / ReRAM / PCM)variednon-volatilens ~ μssmall capacityembedded, research

Each variant gets its own section below, explaining the circuit, process, variants, and why it is this way. Starting with the fastest, the register, and working down the pyramid.

Register File — Multi-Port Flip-Flops · Faster Than Cache SRAM

The storage cell of a register file is not a 6T SRAM but a flip-flop (D flip-flop, often 16-24 transistors) or a master-slave latch: more transistors, larger area, but strong drive, non-destructive read/write, and natural support for same-cycle multi-port access — a cache 6T cell has only one pair of bitlines, and making it multi-port means either switching to 8T/10T or time-multiplexing, neither of which can sustain sub-nanosecond latency.

The cost of multi-port is an area explosion. An N-port register file needs N wordlines + N sets of bitlines + N read paths pulled out of every bit cell, and the area grows roughly as N² (dictated by wire-mesh crossings). A superscalar CPU typically wants 6R3W to 8R4W, and a GPU SM has to feed 32 lanes; piling on ports naively would make the register file bigger than the execution units.

Banking is the industry-standard solution — chop the whole file into 4 to 8 small banks, each only 1R1W or 2R1W, and synthesize true multi-port by “accessing different banks in the same cycle”; the moment two instructions collide on the same bank, stall one cycle. Itanium, the Alpha 21264, and modern GPUs all do this.

The CPU adds one more layer — the physical register file (PRF) + register renaming. There are only 16-32 architectural registers (fixed by the ISA), but Skylake’s PRF has 180 integer + 168 floating-point entries, and the rename table maps each instruction’s destination register to a free physical entry, resolving WAR/WAW false dependencies so the out-of-order window can open up.

The GPU takes a completely different road — the H100’s single-SM register file is 256 KB (65536 32-bit entries), larger than the L1. It doesn’t chase speed through multi-port; instead it statically partitions the registers among dozens of resident warps. When one warp stalls on memory it instantly switches to the next, hiding latency through warp switching, and the register file just needs to be “big enough to go around” — which is exactly why GPU occupancy is limited by register usage.

Physical register file PRF · 4 banks · 2R1W each · synthesizes 8R4Waddress decode + port arbitration (crossbar)same-bank conflict → stall 1 cycleBank 02R1W · D-FF cells≈ 64 entries × 64 bitflip-flop arraywordline + bitline ×3Bank 12R1W · D-FF cells≈ 64 entries × 64 bitflip-flop arraywordline + bitline ×3Bank 22R1W · D-FF cells≈ 64 entries × 64 bitflip-flop arraywordline + bitline ×3Bank 32R1W · D-FF cells≈ 64 entries × 64 bitflip-flop arraywordline + bitline ×3R0R1R2R3R4R5R6R7W0W1W2W38 read ports → feed ALU / AGU / FPU source operands4 write ports ← execution unit writeback / load completionCPU side — the rename table maps architectural registers r0-r31 to any free physical entry here, breaking false dependenciesGPU side — the whole 256 KB file is statically split among dozens of resident warps; switching warp = swapping a base pointer, done in 1 cycle
The industrial implementation of a multi-port register file — rather than stacking a single giant 8R4W array, it is chopped into 4 small 2R1W banks, and the external multi-port is synthesized through crossbar arbitration. The bit cell uses flip-flops (blue read bitlines + red write bitlines, wordlines/bitlines growing linearly with port count, area ≈ N²). The CPU layers rename + PRF on top to break false dependencies; the GPU simply splits the whole file among many warps and hides latency through switching.

The register file is the only storage in the entire hierarchy that isn’t called a “cache” — because it caches nothing; it is the very state that compute instructions operate on directly.

Cache 6T SRAM — The Common Substrate of L1/L2/L3 · Scaling Stalls at 5nm

The core of SRAM (Static RAM) is the 6T cell: 6 transistors forming a cross-coupled bistable loop that, as long as it is powered, holds a 0 or 1 indefinitely and needs no refresh. This is where its “static” comes from. Reading directly senses the level at the storage node, with extremely low latency (sub-nanosecond); writing forcibly flips the loop state, which is just a matter of a few transistor switches.

But the drawback is equally fatal: each bit takes 6 transistors, so density is low, area is large, and cost per unit capacity is sky-high. This is why SRAM can only be used in the most premium spots — the CPU’s L1/L2/L3 cache, the register file, the GPU’s shared memory and L2 cache. Reaching MB-scale capacity is already the upper limit.

High-resolution micrograph of the die surface of a decapped Mikron 16Mbit SRAM chip
The die surface of a decapped Mikron 1663RU1 16 Mbit SRAM chip (90 nm process) — the large area of regular grid texture is the SRAM cell array, and the irregularly shaped small blocks around the edges are the address decoders, sense amplifiers, and I/O circuits. Almost the entire die is SRAM array, a vivid illustration of “the 6T cell has low density and takes up a lot of space.” (Source: Wikimedia Commons, ZeptoBars, CC BY 3.0)

Drawing the SRAM 6T cell and the DRAM 1T1C cell side by side, you can immediately see “why SRAM is fast and expensive while DRAM is slow and cheap”:

SRAM · 6T cellWLBLBL̄M5M6M1M3M2M4QVDDVDDGNDGNDcross-coupled bistable + 2 access transistors”holds” 0 / 1 while powered, no refresh neededread latency < 1 ns · 6 transistors per bitDRAM · 1T1C cellWLBLM1CGND1 transistor + 1 capacitorcapacitor stores charge · leaks, must refresh every few msread ≈50 ns · cell area is 1/6 of SRAMSRAM:fast · no refresh · logic-process compatible · expensive · small capacityDRAM:slow · needs refresh · separate process · cheap · large capacity
To store the same single bit, SRAM uses 6 transistors to build a cross-coupled bistable loop — locked the moment it is powered, needing no capacitor and no refresh, but eating area. DRAM uses just 1 transistor + 1 capacitor — the charge leaks and must be rewritten every few milliseconds, but the cell is far smaller. This one diagram explains why SRAM goes in cache and DRAM goes in memory.

Process Details — 6T Physical Size · Peripheral Circuits · Write/Read Assist

The physical size of the 6T cell — the layout area of a 6T bit cell is the core metric for SRAM process. In Intel’s 22nm era the cell area was ≈ 0.092 μm², at 7nm ≈ 0.027 μm², TSMC N5 ≈ 0.021 μm², and N3 only shrank to ≈ 0.0199 μm² (just 5% smaller); the N2 GAA node is, per public data, about 0.0175 μm². SRAM scaling has essentially stalled — this is the root reason L3 capacity growth relies on stacking (3D V-Cache) rather than shrinking the cell.

Peripheral circuits eat half the area — within the array, only the cell itself is effective storage. The ring of peripheral logic — row decoder, wordline driver, column mux, sense amplifier, precharge, I/O — often takes 30% to 50% of the total macro area. L1 uses small macros (a few KB per block) with a high peripheral ratio and high speed; L3 uses large macros (hundreds of KB per block) to amortize the peripheral overhead, high density but slow access.

Assist circuits — mandatory once cells shrink — at advanced nodes, Vt mismatch, leakage, and read/write noise margin all degrade, and SRAM must pair with write assist (Negative Wordline Underdrive NWUD, Negative Bitline NBL, VDD collapse that momentarily pulls down the cell voltage to ease writes) and read assist (raising the wordline or adding boost during reads to prevent read disturbs from flipping the cell) to work at low Vdd. GPU SMEM and register files commonly use the 8T cell to support multi-port / low voltage (an independent read port, trading density for noise margin).

The impact of FinFET to GAA — in the FinFET era the fin count dictates drive strength, and the fin ratio of the PU/PD/PG transistors inside the 6T directly sets the β/γ ratio. After N3 the move to GAA nanosheets makes the sheet width continuously tunable, in theory restarting SRAM scaling, but in practice the N2 gain is still quite limited.

ECC is nearly universal — modern CPU L2/L3 commonly carries SECDED (single-bit correct, double-bit detect) or the stronger DECTED, and L1 data often carries parity too; adding 8 bits of ECC per 64 bits of data is already an industry standard — the price extracted by the soft error rate and Vt jitter below 5nm.

left · SRAM MACRO top viewarea ratio of cell array + peripheral circuitsROWDECODER+ WL DRV≈ 12%6T CELL ARRAY256 × 256 bit · 65 KBN3: 0.0199 μm² / cellSENSE AMP · COL MUX · PRECHARGEI/OWRITEASSIST(NWUD)≈ 10%peripheral total ≈ 38%cell ≈ 62%right · BIT CELL AREA scalingunit μm² · smaller is denser0.100.060.040.020.09222nm.05914nm.03110nm.027N7.021N5.020N3N5 to N3 only 5% smallerSRAM scaling stalled6T arrayrow decode / WLsense ampI/O + write assist
Left — the top-view footprint of one SRAM macro: the orange center is the 6T cell array (about 60% of the area), and the blue/green/red around it are, in turn, the row decode and wordline driver, the sense amp and column mux, and the I/O and write-assist peripheral circuits, together taking 30 to 40%. Right — bit-cell area across process nodes: from 0.092 μm² at 22nm all the way down to 0.021 μm² at N5, but N3 only shrinks further to 0.0199 μm² (just 5% less) — this is the process backdrop for 3D V-Cache and stacked cache replacing single-layer scaling.

The Two Ends, L1 and L3 — Two Tunings of the Same SRAM

Though both are called SRAM, both are 6T cells, and both use the same process node, the SRAM of an L1 cache and the SRAM of an L3 cache are built noticeably differently. L1 is the “fast” end of this SRAM spectrum, L3 the “dense” end:

same SRAM 6T cell · two tuningshigh-cost end · large and sparse · for speed6T6T6T6T· larger transistors — strong drive, fast charge/discharge· loaded assist circuits — multi-port + dense sense amps· small capacity — short addressing wires, low latencylow-cost end · small and dense · for capacity· transistors as small as the process allows· lean shared assist circuits — reused across cells· large capacity — high backstop hit rateL1 ≈ tens of KB · ≈1 ns · priciest per areaL3 ≈ tens of MB · ≈10 ns · cheapest per area
Left: the SRAM cell of an L1 cache — larger transistors, wider spacing, and multi-port assist circuits, traded for speed; large area, can’t be made big. Right: the SRAM cell of an L3 cache — cells made as small as possible, packed tightly, with shared peripherals, traded for high density and large capacity; an order of magnitude slower. Same 6T circuit, same process node, just engineers picking a different trade-off point.

L1’s “extravagance” shows up in three things — larger transistors (strong drive current, fast charge/discharge), loaded assist circuits (multi-port, dense sense amps, with some designs simply using 8T so reads and writes don’t interfere), and capacity deliberately kept small (SRAM access latency rises as capacity grows, so L1 is intentionally only tens of KB — not because it can’t be made bigger, but because it is actively kept small for speed). L3 goes the other way — transistors as small as possible, cells packed as tight as possible, peripheral circuits lean and shared, traded for large capacity and low cost, at the price of latency reaching the ≈10 ns scale.

It is precisely because the 6T cell can no longer shrink that large caches can only grow by stacking (AMD 3D V-Cache, Apple/Intel chiplet L3) rather than by a more advanced node.

3D V-Cache — Hybrid Bonding Stacks L3 On Top

Making SRAM into a separate die stacked on the CCD relies on one key process — hybrid bonding (Cu-Cu direct bonding): the copper pads of the two dies are polished to atomic-level flatness, aligned and pressed together at room temperature, then heated so the copper atoms inter-diffuse and grow directly together. No microbumps (solder balls), no underfill, with a bonding pitch of ≈ 9 μm — more than 10× denser than HBM’s microbump TSV stacking, with far lower parasitic capacitance, essentially equivalent to on-die interconnect. This is TSMC’s SoIC process; Intel’s counterpart is Foveros Direct.

AMD’s three-generation evolution: the 2022 Zen 3 5800X3D debuted it, stacking a 64 MB L3 SRAM die on top of the 8-core CCD, expanding L3 from 32 MB to 96 MB and sharply raising hit rates in gaming. But with the V-Cache sandwiched between the CCD and the heatsink, thermal resistance rose, and the frequency was forced down to 4.5 GHz with overclocking locked out. The 2023 Zen 4 7800X3D / 7950X3D flipped the direction, moving the V-Cache below the CCD so the CCD touches the IHS directly for cooling, bringing frequency back to 5.0 GHz. The late-2024 Zen 5 9800X3D further optimized the stack and power delivery, raising frequency to 5.2 GHz and unlocking full-core overclocking for the first time.

Not every product does it, because hybrid bonding is yield-sensitive and an extra SRAM die isn’t cheap, so it only pays off on niche SKUs like gaming / HPC that are extremely sensitive to L3 hit rate.

ZEN 3 · 5800X3D (2022)V-Cache on top of CCD · frequency limited by thermal resistanceheatsink / IHSV-Cache die · 64 MB L3 SRAMN7 process · ≈ 41 mm²Cu-Cu bondCCD · 8-core Zen 3 + 32 MB L3TSV through the CCD for signal / powerpackage substrate · solder balls to boardheatProblem: V-Cache sits in the middle, raising thermal resistance,all-core frequency pressed to 4.5 GHz, overclocking forbidden.ZEN 4 / 5 · 7800X3D · 9800X3DV-Cache flipped below CCD · CCD cools directlyheatsink / IHSCCD · 8-core Zen 4/5 + 32 MB L3touches IHS directly · short thermal pathCu-Cu hybrid bond (same process, direction reversed)V-Cache die · 64 MB L3 SRAMZen 5 optimizes power delivery, up to 5.2 GHzpackage substrate · solder balls to boardheatResult: CCD cools directly, all-core frequency back to5.0 to 5.2 GHz, 9800X3D unlocks overclocking.ProcessTSMC SoIC (used by AMD) · Intel counterpart Foveros Direct · vs HBM microbump pitch ≈ 55 μm · hybrid bond ≈ 9 μm, 10×+ denser
3D V-Cache cross-section — left, Zen 3, V-Cache on top of the CCD, blocking the thermal path; right, from Zen 4 the direction is reversed, V-Cache sinks down and the CCD touches the IHS directly. Both generations use TSMC SoIC’s Cu-Cu hybrid bonding, bonding pitch ≈ 9 μm, no microbumps and no underfill, an order of magnitude denser than HBM’s TSV + microbump stacking.

In one sentence: V-Cache routes around the dead-end of “advanced processes can no longer shrink SRAM” in the vertical direction using hybrid bonding — at the price of yield sensitivity and high cost, appearing only on the SKUs most hungry for L3 hit rate.

DDR — Motherboard-Slot DRAM · On-Die ECC Since DDR5

DDR is the memory stick that goes into a motherboard slot in a desktop / server. DDR4 (2014) peaked at a data rate of 3200 MT/s; DDR5 (2020 standard) starts at 4800 and now ships at 5600 / 6400, with the JEDEC roadmap pointing straight at 8400+ MT/s. The most critical internal changes in DDR5: the traditional 64-bit channel is split into two independent 32-bit sub-channels, each with its own command/address, improving concurrency on small dies; on-die ECC is made mandatory (to mask the bit flips brought by cell miniaturization); the PMIC moves from the motherboard onto the DIMM itself, for steadier fine-grained power delivery.

Unlike logic chips, DRAM does not march to a 3 nm node — the heart of the cell is a capacitor that has to hold enough charge to be read reliably. So DRAM has its own fab lines + codenames: 2016 to 2019 saw 1x / 1y / 1z (about 18 / 17 / 16 nm equivalent); from 2021 came (the three makers ranging from 18 to 14 nm); 2022 to 2023 with heavier EUV use; 2024 to 2025 (Samsung and SK Hynix’s first batch with full EUV on key layers). The difficulty isn’t shrinking the linewidth but making the capacitor three-dimensional (deep trench / high aspect ratio) — the deeper the etch and the smaller the cell, the harder the process.

DIMM form factors are also diverging: UDIMM for consumers (no buffer, direct connection); RDIMM for servers (command/address goes through the RCD register buffer); LRDIMM adds buffers on the data lines too; MRDIMM (2024+) does 2:1 multiplexing on the DIMM, doubling external bandwidth to 8800 MT/s, with a single stick reaching 256 GB; the new laptop form factor CAMM2 has one slot compatible with both DDR5 and LPDDR5X. Consumer capacity has gone from 32 GB / DIMM up to 64 GB, and server RDIMM mainstream is now 128 GB+.

DDR5 RDIMM top view · one DIMM = two independent 32-bit sub-channelsSub-channel A · 32-bit + 8-bit ECCDRAMx4 / x81γ nodeDRAMdie 2DRAMdie 3DRAMdie 4RCDcommand/addressregister bufferDRAMdie 5DRAMdie 6DRAMdie 7DRAMdie 8Sub-channel B · 32-bit + 8-bit ECCPMIC1.1 V · power mgmt on DIMMSPD EEPROMtiming / capacity / vendortemperature sensor TSon-die ECC built into the diesDDR4 → DDR5 key changes3200 → 4800 to 8400+ MT/sdual sub-channel · PMIC on DIMMedge connector · 288 pin (DDR5) — center notch keys it · left 32-bit DQ + ECC · right 32-bit DQ + ECCCH-A DQ[0:31] + ECC[0:7]CH-B DQ[0:31] + ECC[0:7]
DDR5 RDIMM top view — DDR5 splits the traditional 64-bit channel into two independent 32-bit sub-channels (orange / blue), each side with 4 DRAM dies + 8 bits of on-die ECC, the green RCD in the middle registering command/address, and the red PMIC at lower left moving power management from the motherboard onto the DIMM, with the SPD EEPROM and temperature sensor beside it. The 288-pin edge connector has a center keying notch that physically separates the two sub-channels.

At the form-factor level, MRDIMM adds a layer of 2:1 multiplexing on the RCD to double the external rate again — the dominant theme of server memory after 2024.

LPDDR — Low-Power DRAM for Mobile · Hugging the SoC Ever Closer

The storage cell of LPDDR is identical to desktop DDR — both are 1T1C, with the differences entirely in voltage, refresh strategy, and interface. LPDDR4 (2014) → LPDDR4X (2017, Vddq dropped from 1.1 V to 0.6 V) → LPDDR5 (2019) → LPDDR5X (2022, 8533 MT/s) → SK Hynix’s LPDDR5T (2023, 9600 MT/s) → LPDDR6 (standard under development 2024-2025, targeting 12800+ MT/s). Each generation mainly stretches the data rate, lowers the voltage, and adds low-power hooks.

The low-power mechanisms are its soul: Deep Sleep / Deep Power Down shuts off most circuits when idle, keeping only the necessary state; Partial Array Self-Refresh (PASR) refreshes only the banks that still need to retain data, leaving empty banks unrefreshed; Temperature-Compensated Self-Refresh (TCSR) dynamically adjusts refresh frequency by temperature (at low temperatures the capacitor leaks slowly, so the refresh interval can be stretched); since LPDDR5 there is also Sub-Bank — each bank is split into two groups internally, allowing finer-grained parallel access, lowering power + improving bandwidth utilization.

Packaging follows three routes — phones use PoP (Package-on-Package), with LPDDR dies stacked directly on top of the SoC; laptops use the newly arrived CAMM2 (2023+), making LPDDR into a mezzanine module attached to the motherboard, achieving replaceable + large capacity + low latency to replace SO-DIMM; Apple’s M series and Intel’s Lunar Lake simply build the LPDDR dies into the SoC package itself, sharing the fabric (so-called unified memory) — the CPU / GPU / NPU all access the same block of LPDDR, eliminating data copies.

three LPDDR packaging form factors · hugging the SoC ever closerPoP (phone)DRAM stacked right on the SoCmotherboard PCBSoCphone APLPDDR8-24 GBCAMM2 (laptop)mezzanine module · replaceablemotherboard PCBCPUCAMM2 moduleDRAMDRAMDRAMDRAMsingle-sided · short traces · removablein-SoC integration (Apple M / Lunar Lake)LPDDR into the SoC package · unified memorymotherboard PCBsame package substrateSoCCPU + GPU+ NPUshared fabricLPDDRLPDDRmost area-efficient · not replaceablecompromise · replaceable + near performanceshortest traces · shared address space
The three LPDDR packaging routes, left to right, hug the SoC ever closer. Left: PoP — in a phone the LPDDR dies are stacked directly on top of the SoC, soldered down, with the shortest traces and the smallest motherboard footprint, but never upgradable. Middle: CAMM2 — the new laptop standard since 2023, making LPDDR into a single-sided mezzanine module attached to the motherboard, both replaceable and capable of high frequency. Right: Apple’s M series and Intel’s Lunar Lake co-package the LPDDR dies right with the SoC, with the CPU / GPU / NPU sharing the same physical memory, eliminating cross-chip copies — this is the hardware meaning of “unified memory.”

The LPDDR roadmap looks similar to DDR’s on the surface but is fundamentally different — DDR optimizes for “replaceable + large capacity,” LPDDR for “close to the SoC + bandwidth per watt.” So the same 1T1C cell, in packaging philosophy, runs from PoP all the way to unified memory, hugging closer and closer.

GDDR — Consumer Graphics VRAM · GDDR7 PAM3 Runs 32 Gbps

GDDR is the VRAM solution for consumer graphics cards — independent DRAM dies soldered onto the PCB around the GPU. The cell is no different in essence from DDR; the difference is entirely in the interface process. Generational leaps are driven almost entirely by “signal modulation.”

GDDR6 (2018, made by all three of Samsung / Micron / SK Hynix) — data rate 14 to 18 Gbps, NRZ two-level signaling (high / low = 1 bit / symbol), BL16, 8 / 16 Gb per die.

GDDR6X (from 2020) — supplied exclusively by Micron for NVIDIA’s RTX 30 / 40 series, 19 to 24 Gbps, switching to PAM4 four levels (2 bit / symbol), keeping the frequency the same but moving 1 more bit per symbol. The cost is that the level spacing is squeezed in half, the SNR degrades sharply, and power rises.

GDDR7 (2024-2025) — 28 to 32+ Gbps, with JEDEC switching to PAM3 three levels (about 1.5 bit / symbol). It looks like a step back from PAM4, but the three-level spacing is 50% wider, so the SNR and bit error rate are both better, and at equal bandwidth power is actually lower. BL32, 16 / 24 Gb per die.

The difference from HBM — GDDR is planar packaging, PCB-soldered, with a 32-bit interface per die; a typical RTX 4090 uses 12 dies to form a 384-bit bus; HBM is vertical stacking + silicon interposer, 1024-bit per stack. GDDR is 4 to 10× cheaper per unit bandwidth, so consumer cards can afford it — which is why gaming cards use GDDR throughout, and only AI accelerators go to HBM.

The PCB cost — at signal rates of 30 Gbps, traces must be extremely short (within a few centimeters of the GPU) and strictly length-matched; the RTX 4090’s VRAM subsystem alone draws about 100 W.

GDDR6 · NRZ · 2 levels1 bit / symbol · 14-18 GbpsV1V001010101GDDR7 · PAM3 · 3 levels≈1.5 bit / symbol · 28-32 Gbps+V0−V0+0++0GDDR6X · PAM4 · 4 levels (Micron exclusive for NVIDIA RTX 30 / 40)2 bit / symbol · 19-24 Gbps · level spacing squeezed in half · poor SNR · high power11100100001101101100100100111001NRZ2 levels · best SNRmore bandwidth only by raising frequencyPAM44 levels · 2 bits per symbol at same frequencylevels squeezed in half · BER worsensPAM33 levels · half a step back from PAM4better SNR · lower power · GDDR7 picks ittrendonce frequency tops outadd bandwidth via modulation
The essential difference among the three modulation schemes — NRZ carries 1 bit per symbol, so once frequency is limited it can only switch faster; PAM4 splits the voltage into four levels and carries 2 bits per symbol, but adjacent levels are crammed together so any disturbance causes a misread; GDDR7 switches to PAM3’s three levels, and although each symbol carries only about 1.5 bits — fewer than PAM4 — the level spacing is 50% wider, the SNR and bit error rate are both better, and at equal bandwidth power is actually lower. This is why GDDR6X to GDDR7 actually “steps back.”

In one sentence: from GDDR6 onward, GDDR’s generational leaps no longer rely on “a smaller process,” but on signal modulation schemes squeezing the bandwidth limit out of the same PCB trace.

HBM — TSV Stacking + Interposer · Base Die Goes Logic-Process Since HBM4

HBM (High Bandwidth Memory) is the most process-intensive branch of the DRAM family. Its storage cell is no different in essence from ordinary DRAM; the expense is in “how it is stacked”:

  • multiple DRAM dies (4-16 layers) stacked vertically;
  • Through-Silicon Vias (TSVs) drilled through the chip to make vertical inter-layer connections;
  • the whole stack is co-packaged tightly against the GPU through a silicon interposer;
  • an extremely wide interface (1024 bit per stack, far beyond GDDR’s 32 bit / chip).

Cut open HBM’s physical structure and you can see directly “why the bandwidth is so high”:

HBM stack · silicon interposer · GPU co-package cross-sectionPCB substrate (Package Substrate)Silicon Interposer — a few-hundred-μm-thick silicon die + thousands of fine metal tracesGPU diecompute cores + L2tens of billions of transistorsadvanced node (3-4 nm)DRAM layer 1DRAM layer 2DRAM layer 3DRAM layer 4Base die · 1024-bit busTSV · Through-Silicon Via (vertical wiring)DRAM layer 1DRAM layer 2DRAM layer 3DRAM layer 4Base die · 1024-bit busBall Grid Array (BGA) → motherboard PCBHBM stackGPUHBM stack
Cross-section of an HBM stack + GPU + silicon interposer co-package — the orange die in the center is the GPU, and each blue “column” on either side is one HBM stack (4-12 DRAM dies + 1 base die layer). The red vertical lines are TSVs (Through-Silicon Vias) that route each layer’s signals straight down to the base die at the bottom. HBM and the GPU are interconnected by thousands of traces on the silicon interposer below (green) — because the traces are extremely short and the interface is extremely wide (1024 bit per side), bandwidth can reach several TB/s. This is the physical reason HBM is expensive but fast.

Generational Numbers + Base Die Revolution — HBM1 to HBM4, 16× in a Decade

The cross-section explains “why it is so fast,” but not how many times faster it got over a decade, where the expense lies across the process steps, or why the three suppliers’ yields differ so wildly.

Generational numbers — per-stack bandwidth rose ≈ 16× in a decade: HBM1 (2015, Fiji) 128 GB/s · 4 GB · 4 layers · 1 Gbps/pin; HBM2 (2016) 256 GB/s · 8 GB · 8 layers · 2 Gbps; HBM2e (2019) 460 GB/s · 16 GB · 8 to 12 layers · 3.6 Gbps; HBM3 (2022, H100) 819 GB/s · 24 GB · 12 layers · 6.4 Gbps; HBM3e (2024, H200/B200) 1.2 TB/s · 36 GB · 12 layers · 9.2 Gbps; HBM4 (2026, JESD270) 2 TB/s · 48 GB · 16 layers · 8 Gbps/pin — the rate actually drops, propping up bandwidth by doubling the bus to 2048-bit/stack.

TSV process — diameter shrinks from ≈ 10 μm to ≈ 6 μm, pitch from ≈ 40 μm to ≈ 25 μm; a 16-Hi stack has to drill 1024 data + a few hundred control + test, totaling thousands of TSVs, and any single open via scraps the whole stack — this is the core yield bottleneck.

The base die revolution — from HBM4, the base die shifts from a DRAM process to a logic process (TSMC N5/N3), able to integrate a custom controller or even compute-in-memory units — NVIDIA and AMD have begun three-way co-design directly with foundry and memory maker, marking HBM’s move from “generic commodity” to “customized co-design.”

The three CoWoS variants — TSMC’s CoWoS-S (silicon interposer, mainstream, used by H100/B200) / CoWoS-L (LSI bridge, cost-reduced, placing a small silicon die only at the HBM-GPU interface) / CoWoS-R (RDL redistribution, thin profile). CoWoS-S capacity is the bottleneck — the physical root of NVIDIA’s long-running H100/B200 shortages, since a single 12-inch wafer yields only dozens of large interposers.

Cooling and bonding — once 16 layers are stacked, heat from the middle layers can’t escape, so SK Hynix uses MR-MUF (Mass Reflow Molded Underfill, applying underfill first then a single reflow), giving good cooling and high yield; Samsung uses NCF (Non-Conductive Film, bonding film layer by layer), which suffered on HBM3e yield and kept it from landing NVIDIA’s big orders. The supplier landscape: SK Hynix exclusively supplies most of the H100/H200/B200 orders, Micron broke through with HBM3e to grab a slice, and Samsung is betting on catching up with HBM4.

HBM generational evolution · per-stack bandwidth · key process nodesGB/s200015001000 500 0HBM1128 GB/s20154 GB · 4 layers1 GbpsFijiHBM2256 GB/s20168 GB · 8 layers2 GbpsV100HBM2e460 GB/s201916 GB · 8/123.6 GbpsA100HBM3819 GB/s202224 GB · 12 layers6.4 GbpsH100HBM3e1.2 TB/s202436 GB · 12 layers9.2 GbpsH200 / B200HBM42 TB/s202648 GB · 16 layers8 Gbps2048-bit busprocess evolutionTSV diameter ≈ 10 μm → ≈ 6 μm · pitch ≈ 40 μm → ≈ 25 μm · thousands of TSVs per stackHBM4 base die shifts from DRAM process to TSMC N5/N3 logic process — can integrate a custom controllerbonding process: SK Hynix MR-MUF (mainstream) · Samsung NCF (yield behind) · Micron breakthrough
A decade of HBM generational evolution — the horizontal axis is time (2015 to 2026), the vertical axis per-stack bandwidth. Orange dots are shipping generations, the red dot is HBM4 (2026 standard JESD270); per-stack bandwidth grew ≈ 16× in a decade, yet HBM4’s pin rate actually drops from 9.2 Gbps to 8 Gbps — propping bandwidth up to 2 TB/s by doubling the bus width to 2048-bit/stack. The three lines of annotation at the bottom are the parallel process threads: TSVs drilled ever finer and denser, the base die switching to a logic process from HBM4, and the bonding process directly determining the suppliers’ yield ranking.

In one sentence: HBM is not “faster DRAM,” but “an engineering marvel that uses advanced packaging to wire ordinary DRAM into an ultra-wide bus” — and this engineering chain (TSMC CoWoS + SK Hynix MR-MUF + custom base die) is precisely the deepest bottleneck in today’s supply of AI compute.

3D NAND Flash — Charge Trap + String Stacking + CMOS Bonding

The core mechanism of NAND Flash (used in SSDs and USB sticks) is the floating-gate transistor — an “island” wrapped in insulating oxide. Writing uses a high voltage to inject electrons into the floating gate; once the voltage is removed, the insulating layer keeps the electrons locked inside and they don’t escape even when power is off. This is the root of its “non-volatility.” Reading relies on measuring whether there are electrons in the floating gate, which changes the transistor’s threshold voltage; a table lookup tells whether this is a 0 or a 1. The “NAND” in NAND refers to the cell’s connection topology (strung into a long chain like a NAND gate), so reading a single cell requires first “opening” the whole string — random reads are not fast.

NAND has a few unique characteristics: writing is much slower than reading (injecting electrons is slow, tens to hundreds of microseconds); writing has a lifespan (the high voltage repeatedly breaks down the insulating layer, leaving a bit of damage each time, and a cell ages out after a few thousand to tens of thousands of program/erase cycles); it cannot be modified in place (NAND can only “erase then write,” and erasing is done by “block,” so changing one byte means erasing the whole block first).

The turning point in NAND Flash process evolution was around 2013 — planar (2D) NAND shrank to a dozen-odd nanometers and hit a physical limit: cells interfered with each other, lifespan dropped, and charge retention time fell short. The solution was to “stand the whole cell up,” switching to 3D NAND — no longer shrinking on the plane but stacking the storage cells layer by layer, sharing one vertical channel.

2D NAND · planar layoutcellcellcellcellcellcellcellcellcellcellcellcellcellcellcellcellexpand capacity by shrinking the cell→ hits a physical wall at ≈15 nmcell interference / lifespan dropstopped evolving around 20133D NAND · vertical stack cross-sectionBLchannelWL0cellWL1cellWL2cellWL3cellWL4cellWL5cellWL6cellWL7cellSL · source line · substratestack 200+ layersnow at 321 layersone wordline per layersharing one vertical channelsame silicon = hundreds of times the capacitydon’t shrink the cell · just stack uporthogonal process dimension:SLC 1 b · MLC 2 b · TLC 3 b · QLC 4 b · PLC 5 b↑ capacity · ↓ speed / lifespan
Left: 2D NAND is a single-layer planar cell, expanding capacity by shrinking size, hitting a wall at 15 nm. Right: 3D NAND stands the storage cells up, with one vertical channel running through all layers, one wordline per layer, forming a cell at each intersection — now stacked to 321 layers. 3D NAND and “how many bits per cell” (SLC/TLC/QLC) are two orthogonal process dimensions that together determine an SSD’s tier.

Three Orthogonal Process Dimensions — FG/CT · String Stacking · CUA/CBA

By 2024, “standing it up” alone was no longer enough for 3D NAND — internally it must also solve three things: how charge is stored, how to stack hundreds of layers, and where to put the peripheral CMOS.

Floating gate → charge trap (FG → CT): early 3D NAND still used the 2D-era floating gate (FG) — a small island of conductive polysilicon to lock electrons. But once the layer count is high, the parasitic capacitive coupling between adjacent floating gates becomes severe, and cell-to-cell interference eats into the threshold window. From around 2015, Samsung / SK Hynix / Micron all switched to charge trap (CT): replacing the conductive silicon island with a layer of insulating silicon nitride (SiN), trapping electrons in dielectric defects so they can’t escape — less leakage, stronger read-disturb immunity, and a thin film better suited to vertical deposition.

String stacking (segmented stacking): a 300+ layer channel hole with a 60:1 aspect ratio can’t be etched in one pass — the edge verticality collapses and the alignment accuracy can’t hold. The trick is to stack one segment, etch the channel, then stack a second segment and splice the channel. Samsung’s V8 is 2-stack 236 layers, SK Hynix’s V9 is 3-stack 321 layers (in production 2024), Samsung V9 290+ layers, Micron G9 276 layers, YMTC X4-9070 294 layers. Each added segment accumulates alignment error, the wordline resistance grows linearly with layer count, and the erase voltage has to be raised to match.

CUA vs CBA (where the peripheral CMOS goes): CUA / CuA (CMOS-under-Array, Micron / Intel) tucks the peripheral logic — decoders, sense amps — beneath the NAND array on the same wafer, saving ≈25% area. CBA / Xtacking (SK Hynix / YMTC) takes the other road — the CMOS is made on a separate wafer, then wafer-on-wafer bonded on. The two can be optimized independently: logic uses an advanced node to cut latency, NAND uses a thick-film process to retain charge, at the cost of running two production lines.

TLC vs QLC: QLC has to divide the same charge window into 16 levels, with more verify steps per program, writing 2 to 3× slower, and program/erase endurance drops to ≈1000 PE cycles (TLC is 3000 to 5000). So consumer SSD mainstream is still TLC (1 to 8 TB), enterprise eTLC reaches 30+ TB per drive, and single-die capacity is 1 to 2 Tb (2024).

CUA · CMOS-under-ArrayMicron / Intel · stacked top-to-bottom on one waferCBA / Xtacking · CMOS-bonded-ArraySK Hynix / YMTC · two wafers bondedsame silicon3D NAND array · 200 to 321 layersCMOS peripheral logicdecoder / sense amp / page buffershares the same process as NANDsilicon substrate≈ 25% areasavedwafer A · NAND3D NAND array (thick-film process)hybrid bonding · Cu-Cu bonding interfacewafer B · CMOSCMOS peripheral logic (advanced node)decoder / sense amp / page bufferindependently optimized · decoupled from NANDmore advanced logic node · lower latencysilicon substrate (wafer B)one more cutpro · done in one pass on one wafer · low costcon · CMOS locked to NAND processpro · two wafers independently optimized · lower latencycon · two production lines · bonding-yield sensitive
The two 3D NAND peripheral-CMOS integration routes — left: CUA puts the decoders / sense amps directly beneath the NAND array, done in one pass on the same wafer, saving about 25% area, but the CMOS must follow NAND’s thick-film process. Right: CBA / Xtacking uses two independent wafers, and the logic wafer can use a more advanced node; once done, it is joined to the NAND wafer through hybrid bonding (Cu-Cu bonding), at the cost of two production lines + bonding yield. The thin green layer is the bonding interface — the small red dots are the conductive pads passing through, numbering up to hundreds of thousands per square millimeter.

Stack these three dimensions together: FG → CT decides whether the cell can be stacked high, string stacking decides how many layers can be stacked, and CUA / CBA decides how big the whole die is — three things each orthogonal, and any vendor’s current-generation product is one combination of these three.

NOR Flash — Byte-Granularity Random Read · XIP Runs Code Directly

NOR shares the floating-gate principle with NAND, but the cell topology is utterly different: each NOR cell has one end on the bitline and one on the source line, individually addressable, like NOR gates in parallel; NAND strings 32 to 128 cells into one string, and reading any single bit requires opening the whole string. This difference is what lets NOR do eXecute-In-Place (XIP) — after the CPU powers on, the reset vector lands directly in the NOR address space and fetches and executes instructions byte by byte without first DMA-ing into DRAM, which is precisely the fundamental reason BIOS / UEFI must use NOR; NAND’s page / block access simply can’t do this.

On the interface side, the early Parallel NOR (Intel 28F, AMD 29F series) used parallel address / data buses with 40+ pins and is largely obsolete; the current mainstream is Serial NOR / SPI NOR (Winbond W25Q, Macronix MX25, GigaDevice GD25 series), with only 4 to 8 pins, and in QSPI four-line or OSPI eight-line DDR mode the read bandwidth can reach 100 to 400 MB/s — enough for XIP and enough for boot.

On the market side, NOR has always been a fragmented fabless / small-and-medium-maker business — split among Winbond, Macronix, GigaDevice, Cypress (acquired by Infineon), and Microchip, while Samsung, Micron, and SK Hynix left long ago because the capacity is small, unit prices are low, and margins are thin. The process node has stalled at 45 nm to 28 nm and advances no further; mainstream capacity is 1 Mb to 256 Mb (0.125 to 32 MB), with the flagship maxing at 2 Gb (256 MB) — three orders of magnitude below NAND’s 1 Tb single die. Typical uses: an 8 to 32 MB SPI NOR on the motherboard storing UEFI, automotive ECU firmware, IoT MCU embedded code flash, the SSD controller’s own boot ROM, and the bootloader of switches / routers.

NOR · parallel · byte random read · XIPNAND · serial · by page / blockBL0cellcellcellcellBL1cellcellcellcellBL2cellcellcellcellSLindependently selected → single-cell readBLSSLWL0WL1WL2WL3WL4WL5WL6GSLSLread 1 bit → open the whole 32-128-cell stringNOR · 70 ns randomSPI / OSPI · 100+ MB/sNAND · 25 μs pageblock erase ≈ 2 msWinbond W25Q · Macronix MX25 · GigaDevice GD25 · Infineon · Microchip — stalled at 28-45 nm nodes
NOR cells each connect to a bitline, so any byte can be read at will; NAND’s serial structure must first open the whole string, which is why BIOS / UEFI can’t do without NOR.

For this reason, in the foreseeable future, that 8 to 32 MB SPI NOR on the motherboard won’t disappear — as long as the CPU still needs to fetch its first instruction from a fixed address, someone has to backstop XIP.

HDD Magnetic Recording — The Only Non-Semiconductor · HAMR Burns a Laser into the Platter

The mechanical disk is the only non-semiconductor storage at the bottom of the pyramid, and its process competition happens over magnetic domain size and medium coercivity, not process node.

LMR → PMR (2005) — early magnetic domains lay flat on the platter surface, adjacent domains squeezing each other, with density hitting a wall at ≈100 Gb/in². PMR stands the magnetic domains up, immediately shrinking the footprint, and debuted at an areal density of ≈200 Gb/in²; later ePMR adds a bias current at the write head to stabilize switching, and in 2024 WD’s Ultrastar HC780 / HC790 pushed ePMR to 30 / 32 TB per drive.

SMR (shingled) — adjacent tracks partially overlap like roof shingles, adding another +25% density. Reading is unaffected, but writing one track requires erasing and rewriting N downstream tracks, causing severe write amplification, so it suits only cold data / archive and can’t host a system disk.

HAMR (heat-assisted) — to keep shrinking the magnetic domain, the medium must switch to high-coercivity FePt, which is impossible to write at room temperature. Seagate integrates an ≈800 nm laser diode + near-field optical lens (NFT) into the write head, instantly heating a target spot tens of nm in diameter to ≈450 to 500 ℃ (approaching the Curie point), briefly “softening” the material while the head simultaneously applies a magnetic field to lock the direction, setting in nanosecond-scale cooling. From 2024, Seagate’s Mozaic 3+ is in production at 30 / 32 TB, with an areal density of ≈3 Tb/in² (30× PMR), the roadmap pointing to 5+ Tb/in² and 50 TB per drive.

MAMR (microwave-assisted) — WD / Toshiba’s approach, adding a Spin Torque Oscillator (STO) to the write head to emit microwaves, lowering the switching threshold through ferromagnetic resonance. The process is gentler, but the density gain is limited, and it has been overtaken by HAMR — WD has now pivoted to HAMR to follow.

The physical bottleneck is elsewhere — no matter how areal density climbs, seek ≈4 to 5 ms + 7200 RPM rotational latency ≈4 ms means total access latency ≈8 to 9 ms is forever stuck there. What HDDs improve is $/TB, not IOPS.

HAMR write head cross-section · HEAT-ASSISTED MAGNETIC RECORDINGwrite head sliderflies ≈ 1 to 5 nm · fly heightlaser diodeLD ≈ 800 nmInGaAs / GaAswaveguideNFTnear-field optical lenssqueezes the spot to ≈ 25 nmmainwritepole(field source)read headTMR sensorFePt high-coercivity medium · Curie point ≈ 750 Kglass / Al substrate≈ 470 ℃ hotspotH write fieldplatter spins at high speed (7200 RPM) · head moves relative to medium① laser irradiates≈ ns-scale heating tonear the Curie point② medium softensFePt coercivity plungesfield can flip it③ pole applies fieldwrites direction in sync④ rapid coolingdomain direction lockedstored stably long-termareal density evolution · 2005 PMR ≈ 100 Gb/in² → 2024 HAMR ≈ 3 Tb/in² → target 5+ Tb/in² · Seagate Mozaic 30 / 32 TB in production
HAMR write head cross-section — the laser diode (red) sends light through a waveguide to the near-field optical lens NFT (orange), squeezing the spot to ≈ 25 nm and instantly heating the FePt medium to ≈ 470 ℃ (near the Curie point) so the coercivity plunges; at the same moment the main pole (green) applies the write field to set the magnetic domain direction, locked in by nanosecond-scale cooling. The arrows on the left half are already-written stable domains, the right half is the to-be-written zone that is “too hard to write” at room temperature.

In one sentence: HAMR doesn’t make magnetic recording more precise; it turns “writing” from a purely magnetic process into a thermo-magnetic process that combines “laser + magnetic field” — at which point the HDD moves from a purely mechanical craft fully into opto-electro-mechanical integration.

MRAM — Magnetic Tunnel Junction · Already in Automotive Embedded Use

The storage cell of MRAM is the MTJ (Magnetic Tunnel Junction) — two layers of ferromagnetic metal sandwiching a ≈1 nm thick MgO tunnel barrier. The lower reference layer is pinned by an antiferromagnetic layer, with a fixed magnetization direction; the upper free layer can be flipped. The two magnetic moments parallel = low resistance (stores 0), antiparallel = high resistance (stores 1), read out via the TMR (tunnel magnetoresistance) effect — the resistance ratio between the two states can reach over 200%, so only a small current is needed to sense it.

Writing has two generations of technology. STT-MRAM (Spin-Transfer Torque) is the current commercial mainstream: a spin-polarized current passes vertically through the MTJ, transferring spin angular momentum to the free-layer moment to flip its direction, with a write current of ≈10 to 100 μA and write latency of 10 to 50 ns. SOT-MRAM (Spin-Orbit Torque) runs the write current into a Pt / W / Ta heavy-metal line beneath the free layer, flipping it laterally via the spin Hall effect — read and write paths are separated, lifespan is nearly unlimited, and switching can be as low as 1 ns, but it is still in early commercial use.

Commercial deployment is concentrated in embedded. TSMC eMRAM has advanced from 22 nm (2019) to 16 nm, 12 nm, and on to N5 (2024) as an automotive-grade embedded-Flash replacement; Samsung offers eMRAM for automotive SoCs on 28 nm FD-SOI and 14 nm nodes; GlobalFoundries 22FDX eMRAM serves IoT MCUs; Everspin ships standalone MRAM with DDR3 / DDR4 interfaces, used in industrial control and metadata protection for enterprise SSDs. The advantages are program/erase endurance of ≈10¹² to 10¹⁵ cycles (NAND is only 10⁴), nanosecond read/write, non-volatility, and good radiation tolerance. The shortcomings are also clear — cell area of 50 to 100 F² (NAND ≈4 F²), with total per-chip capacity currently only in the few-MB to 1-Gb range, unable to serve as GB-scale main memory.

MTJ cross-section · parallel (0) · antiparallel (1)top electrodefree layer · flippableMgO tunnel barrier ≈1 nmreference · pinnedantiferromagnetic layer · pinning sourcebottom electrode↑ state 0 (parallel)both arrows alignedtunnel resistance ≈ low↓ state 1 (antiparallel)free flipped 180°TMR ratio ≈ 200%read: apply small voltagemeasure I → decide 0/1STT write · current through MTJ flips freetop electrodefree layerreference (polarizer)bottom electrodewrite current · 10 to 100 μAspin-polarizedcurrent polarized by reference→ angular momentum to free moment→ flips directionlatency 10 to 50 nsendurance ≈10¹² to 10¹⁵ cyclesread/write share one path→ large current wears the MgOTSMC N5 · Samsung 14 nmSOT write · lateral current + spin Hallread endreferencefree layerheavy metal · Pt / W / Tawrite current (lateral)spin current(Hall)read / write paths separatedwrite current avoids the MgO→ nearly unlimited lifespanswitching as low as 1 nsspeed approaching SRAMbut larger cell area(three-terminal device)still early commercialresearch / pilot stage
Left: the MTJ sandwich structure — a pinned reference layer + an MgO tunnel barrier + a flippable free layer, the two parallel / antiparallel setting the resistance high or low. Middle: STT runs a spin-polarized current through the MTJ to flip the free layer, and since read and write share one path the MgO gets worn. Right: SOT routes the write current through the heavy-metal strip below the free layer, flipping it laterally via the spin Hall effect, with read and write separated for longer lifespan and higher speed.

The reason automotive and IoT are the first to take up MRAM is that it lands precisely on the intersection of five requirements — “small capacity, must be non-volatile, must be fast, must withstand high temperature and radiation, must endure unlimited writes” — exactly the scale at which NOR Flash and SRAM-with-backup-battery no longer pay off.

ReRAM / PCM — Resistive + Phase-Change · After Optane Left, It Went Cold

These two emerging NVMs are covered in one section — by now their stories essentially trace the same arc of “almost taking over, ultimately failing.”

ReRAM (Resistive RAM) — two metal electrodes sandwiching a layer of insulating oxide (commonly HfO₂, Ta₂O₅, TiO₂); applying a high enough forward voltage “grows” a conductive filament in the oxide, a low-resistance path formed by orderly arrangement of oxygen vacancies; a reverse pulse partially breaks the filament → high-resistance state. Low resistance = 1, high resistance = 0, read by measuring resistance at a small voltage. Two major branches: OxRAM uses transition-metal oxides + oxygen-vacancy filaments, represented by Crossbar and Weebit Nano (IP licensed to GlobalFoundries 22FDX); CBRAM swaps the top electrode for Cu or Ag, with ions migrating into the insulating layer to form a metal bridge, represented by Adesto / Microchip’s SST-CBRAM. ReRAM’s biggest process advantage is compatibility with CMOS back-end-of-line (BEOL) and low processing temperature, so it can be stacked directly on a logic chip as embedded NVM. But commercial use is still dominated by KB-to-MB-scale embedded IP blocks, with standalone high-capacity products rare.

PCM (Phase-Change Memory) — the core material is GST (Ge₂Sb₂Te₅, germanium-antimony-tellurium alloy): a strong short pulse with rapid cooling → amorphous state (high resistance) = 0; a long low pulse with slow cooling → crystalline state (low resistance) = 1. The most famous product was Intel + Micron’s 3D XPoint / Optane (announced 2015, debuting in 2017 in data-center SSDs and persistent memory DIMMs) — Intel never disclosed whether it was pure PCM, but the process was widely held to be a same-origin derivative. In 2022 Intel officially announced its exit from the Optane business and ended the product line. Earlier still there was Numonyx’s embedded PCM (2008 to 2010, discontinued after acquisition by Micron). PCM’s fatal flaws are: large write current (> 100 μA), thermal disturbance destabilizing adjacent cells, and cell material degrading with cycling (lifespan about 10⁸ to 10⁹ cycles).

Why neither became mainstream — cell density can’t beat NAND (already stacked to 300+ layers × 5 bit/cell); write characteristics sit between DRAM and NAND, but the cost beats neither side; MRAM grabbed the “embedded NVM to replace Flash” market first, and ReRAM / PCM lost ground in automotive and IoT step by step. PCM is now commercially stalled, with only academia still using analog resistance states for in-memory computing and MAC operation research.

CELL CROSS-SECTIONReRAM — filament grows / breaks in the oxide layerPCM — GST switches between crystalline / amorphoushigh-resistance (RESET · 0)top electrode Ti / TiN / CuHfO₂ / Ta₂O₅bottom electrode Pt / TiNlow-resistance (SET · 1)top electrodefilament formedbottom electrodeSET: forward voltage → oxygen vacancies line up into a filamentRESET: reverse pulse → filament partially breaksread: small voltage measures resistance · ratio R_high / R_low ≈ 10²CBRAM variant: top electrode uses Cu / Ag, ions migrate to form a metal bridgeamorphous (RESET · 0)top electrodeGST (Ge₂Sb₂Te₅)heater electrodebottom electrodecrystalline (SET · 1)top electrodeGST (crystallized)heater electrodebottom electrodeRESET: strong short pulse → local melt → rapid cooling → amorphous (high resistance)SET: long low pulse → hold at crystallization temperature → slow cool → crystalline (low resistance)read: small voltage measures resistance · ratio R_high / R_low ≈ 10³fatal flaws: write current > 100 μA · lifespan ≈ 10⁸ to 10⁹ cyclescommercialization timelineReRAM: Crossbar / Weebit / Adesto → embedded IP block, KB-MB scalePCM: Numonyx (2008-2010) → Intel Optane (2015-2022) → academic in-memory computingreasons for exit: density loses to NAND · cost beats neither end · embedded NVM market taken first by MRAMoxygen vacancy / filamentcrystalline GST atomamorphous GST atom
Cross-section comparison of ReRAM and PCM cells — the left half, ReRAM, grows a conductive filament in an oxide layer like HfO₂ by orderly arrangement of oxygen vacancies (low resistance = 1), and returns to a high-resistance state (0) once a reverse pulse breaks it up; the CBRAM variant replaces oxygen vacancies with a Cu/Ag metal-ion bridge. The right half, PCM, uses a heater electrode in a GST alloy to switch between crystalline (regular lattice · low resistance) and amorphous (random scatter · high resistance) — a strong short pulse melts and quenches to write 0, a long low pulse slowly cools to write 1. Both read by measuring resistance at a small voltage; the mechanisms differ but the interface can be the same. The bottom is the commercialization timeline — after Optane exited in 2022, these two retain little beyond the residual warmth of embedded IP and academic in-memory computing.

The write mechanisms are interesting but the engineering math doesn’t add up — the story of emerging NVM is, by now, essentially the same script.

Process Cheatsheet — Each Type’s Process Path Squeezed Into One Row

Squeezing the process path of all the types above into one table:

StorageProcess namingCell structureEvolution pathCore difficulty
SRAM (on-die)follows logic (3 nm / 5 nm)6T bistable (or 8T)advanced logic node + 3D stacked cacheadvanced node can’t shrink
DRAM (DDR/GDDR)1α / 1β / 1γ (codenames)1T1C deep-trench capacitormaking the capacitor 3Dshrinking the cell while keeping capacitor capacity
HBMsame cell process as DRAMDRAM cell + TSV + interposerstack 4 → 8 → 12 → 16 layersTSV yield · cooling
3D NAND”layer count + bit/cell”floating gate / charge trap (3D)200 → 300 → 500+ layerschannel accuracy · yield
NOR Flashstalled at 28-45 nmfloating gate in parallelevolution has stoppedsmall capacity · thin margins
HDDno process nodemagnetic recording (non-semiconductor)PMR → SMR → HAMR / MAMRdomain stability · thermal disturbance
MRAMembedded 22-N5 nmMTJ (MgO tunnel barrier)STT → SOTlarge cell area
ReRAM / PCMBEOL compatiblefilament / GST phase changecommercialization essentially stalleddensity loses to NAND · lifespan loses to MRAM

Summary — One Chain of Cause and Effect Through the Whole Pyramid

Squeezing the whole article into one sentence: all the differences among storage come from the same trade-off — the closer to compute and the more you chase speed, the more you must sacrifice capacity and raise cost per unit; the farther from compute and the more you chase size and cheapness, the more slowness you must tolerate.

Concretely:

  • What sustains the data decides volatile vs not — volatile ones (SRAM, DRAM, HBM) rely on an electrical state maintained by power, which collapses when power is cut; non-volatile ones (Flash, HDD, emerging) rely on a physically trapped or fixed state (electrons trapped in the floating gate, domain direction pinned on the platter), which survives power loss. This is the most fundamental dividing line.
  • Cell complexity decides speed and cost per unit — SRAM uses 6 transistors per bit so it is fast and expensive; DRAM uses 1 transistor + 1 capacitor per bit so it is cheap and can be made large; NAND uses one floating gate per bit and can stack 300+ layers and pack in a few more bits, so its capacity explodes; the HDD is not a semiconductor, lowering cost step by step through material innovation in density.
  • How close it sits and how far it travels decides bandwidth and latency — registers, pressed against the execution units, are the fastest; L1 is faster than L3 because its small capacity means short wires; HBM’s cell is as slow as ordinary DRAM but pulls bandwidth to several TB/s by hugging the GPU + a 1024-bit ultra-wide interface; the HDD is slow because mechanical motion is a fundamental physical bottleneck.

Once you understand this main thread, when you see news like “AMD 3D V-Cache gets bigger again,” “HBM4 has arrived,” “321-layer 3D NAND in production,” “HAMR 30 TB drive launched,” and “MRAM enters automotive grade,” you can immediately tell which process line it advances along — all of them are just one more cut at some spot along the lines of “change the principle / change the process / push stacking / raise density.”

References — Textbooks · Standards · Vendor Materials

Textbooks and Courses

  • 《Computer Architecture: A Quantitative Approach》(Hennessy & Patterson) — the classic of computer architecture, with Chapter 2 on the memory hierarchy explaining it most thoroughly. Editions from the 6th add chapters on HBM / 3D stacking.
  • 《Memory Systems: Cache, DRAM, Disk》(Bruce Jacob, Spencer Ng, David Wang) — a textbook dedicated to the memory hierarchy, covering everything from circuits to protocols to scheduling strategies.
  • MIT 6.004 / CMU 18-447 — the public lecture notes and assignments of these two courses derive the SRAM/DRAM/cache hierarchy very clearly. www.ece.cmu.edu/~ece447

Standards and Vendor White Papers

  • JEDEC — the body that publishes all official standards for DDR / GDDR / HBM / LPDDR; the electrical, protocol, and pin definitions of every memory product come from here. www.jedec.org
  • NVIDIA H100 / B200 Architecture Whitepaper — the practical application details of HBM2e / HBM3 / HBM3e on GPUs. resources.nvidia.com
  • AMD 3D V-Cache technical articles — the engineering implementation details of 3D SRAM stacking. www.amd.com/3d-v-cache
  • SK Hynix / Samsung / Micron HBM technical documents across generations — the TSV count, bandwidth, and stack-layer count of HBM2 / HBM3 / HBM3e / HBM4. news.skhynix.com

Papers and Reviews

  • Salahuddin, Ni, Datta, “The era of hyper-scaling in electronics” (Nature Electronics, 2018) — a review of the scaling limits of semiconductor storage. nature.com
  • Chen et al., “A Review of 3D NAND Flash Technology” (IEEE TED, 2021) — an engineering-level review of 3D NAND process evolution.
  • Mutlu, “Memory Scaling: A Systems Architecture Perspective” (IMW 2013) — a scalability analysis of DRAM and NAND, a classic lecture. people.inf.ethz.ch/omutlu

Process and Manufacturing

  • ASML public technical slides — the EUV / DUV applications of modern DRAM / logic processes. www.asml.com
  • TechInsights teardown reports — actual reverse-engineering teardowns of DRAM / NAND / GPU across generations, with public data on layer count, cell size, TSV pitch, and more. www.techinsights.com
  • AnandTech / Tom’s Hardware deep reviews — accessible explainers of the engineering details of DDR5 / GDDR7 / SSD controllers, good for understanding how process differences manifest at the product level.
  • Seagate / Western Digital HAMR technical white papers — the evolution path of magnetic recording density from PMR to HAMR. www.seagate.com/hamr

Other Long-Form Pieces / Blogs

  • Wendell, “Cache Coherency Explained” (Level1Techs) — a long explainer on cache coherency protocols in multi-core CPUs.
  • Jonathan Corbet, “The Memory Hierarchy” (LWN.net series) — the memory hierarchy from the Linux kernel’s perspective, on how software aligns with the hardware hierarchy. lwn.net/Articles/250967
  • Erik Engheim, “A Visual Guide to GPU Memory” (Medium) — a visual explanation of the GPU storage hierarchy.