AI / Data Infra · Knowledge Map

AI / Data Infra Knowledge Map

A study guide for systems engineers: each chapter has an intro, the questions you should be able to answer, and the concepts, labs, reading, and tools to get you there.

Languages · C++ / Rust / Go

C++ implements almost all the infrastructure that matters — databases, browsers, game engines, inference runtimes. Rust is steadily replacing it in databases, browsers, and operating systems. Go dominates cloud-native services, RPC, and CLI tooling. This module isn't about becoming fluent in all three; it's about putting their mental models side by side — how memory and lifetime are managed, how concurrency and async are scheduled, and which toolchain actually gets your code onto the CPU.

After this module you should be able to answer

C++

What problem does RAII actually solve, and why is it considered more thorough than try/finally or defer?
When do you reach for unique_ptr vs shared_ptr vs weak_ptr? shared_ptr's refcount is thread-safe — is the pointee?
What does move semantics buy you over copying? When does an object you expected to move actually get copied, and when do RVO / NRVO eliminate the copy entirely?
Name three common sources of UB (signed overflow, OOB access, strict aliasing). Why is the compiler allowed to delete code because of them?
In the memory model, how do memory_order_relaxed / acquire / release / seq_cst differ? When is relaxed actually safe to use?
What are the engineering trade-offs between exceptions and error codes? Why do Google, LLVM, and many embedded projects disable exceptions entirely?
What does a vtable implementation look like? Where do `final` and `override` actually help the compiler with devirtualization?
Why can two threads updating their own counters still slow each other down? How do you confirm false sharing with perf c2c or cache-miss events?
How do template instantiation and C++20 concepts compare? Why are libraries replacing enable_if with concepts?
What does a C++20 coroutine actually look like under the hood? What roles do promise_type, suspend points, and coroutine_handle play?
How do constexpr, consteval, and `if consteval` differ in compile-time vs runtime behavior?
What does the pimpl idiom buy you on ABI stability and compilation isolation? What is the cost?

Rust

What does ownership + borrow checker forbid that C++ permits? Why does that eliminate use-after-free at compile time?
What do lifetime annotations accomplish? When does Rust force you to write 'a explicitly versus inferring it via the elision rules?
Why does Rust not need a garbage collector? How do RAII and ownership combine to guarantee resource release?
What do the Send and Sync marker traits actually say? Why is Rc<T> not Sync but Arc<T> is?
What is the core of async/await + Future in Rust? Why is Rust async zero-cost yet still needs an executor? What does Pin solve?
What privileges does unsafe Rust unlock, and when is it unavoidable? Which violations does Miri catch?
What is the type-system contract behind interior mutability (Cell / RefCell / Mutex)?
What does a trait-object fat pointer look like? Why can some traits not be `dyn` (what is object safety)?
When do declarative macros fit, and when do procedural macros? Why are serde / tokio impossible without proc-macros?
Why are errors modeled as Result<T, E> rather than exceptions? What does the `?` operator desugar to, and how does it relate to the Try trait?

Go

How light is a goroutine compared to an OS thread? How does the GMP scheduler decide when to preempt, and what did signal-based async preemption (1.14+) solve?
What's the underlying data structure behind a channel (hchan)? How do buffered and unbuffered channels differ in wake-up semantics?
The GC is concurrent mark-and-sweep — what does the write barrier solve? What's a typical STW pause in modern versions?
Why does Go use error values instead of exceptions? Why did errors.Is / errors.As / wrapping only arrive in 1.13?
How is happens-before defined in Go's memory model? What synchronization guarantees do channel send and receive give you?
How should context.Context be used correctly? How does a cancel signal propagate all the way down to a blocking syscall?
What is an interface itab? How much more expensive is an interface method call than a direct function call?
How are generics (1.18+) implemented? Why did Go pick GC shape stenciling over per-type specialization?
What does sync.Pool solve? What happens to its contents during GC?
When are sync.Map and map + Mutex each faster? Why should ordinary code not default to sync.Map?

Cross-language comparisons

How do the three error-handling models (C++ exceptions / Rust Result / Go error) shape API design and ABI stability differently?
What are the runtime costs and mental models of Rust async, C++ coroutines, and Go goroutines? Why did Rust and C++ pick stackless while Go went stackful?
How do RAII (C++), ownership (Rust), and defer (Go) compare as resource-management strategies? Which scenarios does each handle poorly?
How do C++ templates, Rust generics + traits, and Go generics + type constraints differ in compiled output and error messages?
How do the three build systems (CMake / Bazel, cargo, go build) trade off incremental compilation, dependency management, and reproducibility?

C++

Core concepts

RAII

Tie resource release to object lifetime. It's the foundation C++ resource management is built on — smart pointers, lock guards, and file handles all follow this shape.
Smart pointers

unique_ptr / shared_ptr / weak_ptr make raw-pointer ownership explicit. Using them correctly eliminates most leaks and double-frees.
Move semantics

Transfer expensive resources instead of deep-copying. You can't read modern C++ without understanding rvalue references, std::move, and RVO.
const correctness

const is a read-only contract enforced by the type system, not a comment. Where you put it in an API signature decides what callers can do.
Templates / STL

Templates power C++'s generics and zero-overhead abstractions. STL containers and algorithms are vocabulary you'll meet in every C++ codebase.
Linking and compilation

Understand what preprocessing, compiling, assembling, and linking each do. Linker errors, ODR violations, and symbol visibility all need this mental model.
Common sources of UB

Undefined behavior isn't a bug — it's a promise the compiler assumes you keep. Knowing the common pitfalls (OOB, uninit, aliasing) is how you avoid them.
Memory model

C++11 gave the language a formal memory model that governs visibility of atomic operations. Required reading before you write any lock-free code.
Cache line

The minimum transfer unit between CPU and memory, usually 64 bytes. Alignment, padding, and hot-field clustering all start here.
False sharing

When threads write different fields on the same cache line, the line ping-pongs between cores. One of the subtlest killers in concurrent code.
Branch prediction

Modern CPUs keep the pipeline full by guessing the next instruction. Keep hot-path branches predictable, or go branchless.
SIMD

One instruction, many data lanes. The foundation of vectorization and the reason ClickHouse, numpy, and ffmpeg inner loops look the way they do.
NUMA

On multi-socket boxes, local memory is fast and remote is slow. Databases, JVMs, and inference engines all care about NUMA binding.
Memory bandwidth

Many 'CPU-bound' workloads are actually DRAM-bound. Use STREAM benchmarks and the roofline model to know your ceiling.
CPU pipeline

Out-of-order, superscalar, retirement — without these you can't reason about IPC, stalls, or port contention.

Lab

Write a small allocator

Hand-roll a bump, freelist, or slab allocator. Forces you to confront alignment, fragmentation, and placement new that libraries usually hide.
Implement a hash map

Skip std::unordered_map and write open-addressing yourself. You'll come away with real intuition for cache locality, rehashing, and load factor.
LRU cache

The classic hash-map-plus-doubly-linked-list combo. Common interview question and the minimum viable prototype for any memory/disk cache.
Minimal shell

A full pass through fork, exec, wait, pipe, and dup2. Afterwards bash or systemd source won't look alien.
Crafting Interpreters

Build a bytecode interpreter with GC in C from scratch. You'll finally understand how language runtimes, stack frames, and garbage collection work.
Build-Your-Own-X

Curated tutorials for building your own database, Redis, Git, etc. A reliable source of engineering-grade labs.
MIT 6.172 project

Industrial-strength perf engineering projects, from bit hacks to cache-aware algorithms.
Optimize matrix multiply

The classic warm-up: from the naive triple loop to within 10-100x of BLAS. Touches blocking, vectorization, and threading in one pass.
Write a SIMD kernel

Pick an op (dot product, memcpy, argmax), write AVX2 or AVX-512 intrinsics, compare against the compiler's output.

Reading

Effective Modern C++

Scott Meyers' 42 items on C++11/14. Fastest path from old-style C++ to modern C++.
C++ Core Guidelines

The official style/practice list curated by Bjarne and Herb Sutter. Useful as a tiebreaker in design arguments.
CSAPP first six chapters

Data representation, assembly, processor, memory hierarchy. Lets you draw what you see in C++ all the way down to the machine.
cppreference

The working reference for the C++ standard, far more readable than the ISO draft. Keep it open while you code.
Google C++ Style

How a huge C++ codebase stays survivable. You don't have to adopt it wholesale, but read it once and understand the 'why'.
Agner Fog optimization manuals

The bible of x86 microarchitecture, with per-generation instruction latency tables and tuning advice.
What every programmer should know about memory

Ulrich Drepper's classic long-form paper on CPU caches, NUMA, and prefetching.
Intel Optimization Manual

Intel's official optimization guide. The authoritative reference when microarchitecture questions get specific.
MIT 6.172 lectures

Leiserson's performance engineering course at MIT, lectures and videos fully open.

Tools

gdb

The standard Linux debugger. Stacks, breakpoints, registers, core-dump analysis — table stakes for anyone doing systems work.
lldb

The LLVM-family debugger and macOS default. Commands differ from gdb; worth learning if you live in Clang land.
ASan / UBSan / TSan

Compile-time instrumentation that catches memory errors, UB, and data races respectively. Running them in CI saves countless debugging nights.
perf + FlameGraph

Sampling profiler on Linux plus a visualization that makes CPU hotspots obvious. First stop for any performance tuning.
Valgrind

Dynamic binary translation that checks memory errors. Slower than ASan but needs no recompile, so it works on shipped binaries.
clang-tidy

AST-based static checker and auto-fixer from Clang. Good for mass-applying Core Guidelines or Google Style across a codebase.
-Wall -Wextra -Wpedantic

GCC/Clang warning flags that turn the compiler into a free static analyzer. Pair with -Werror for any CI worth its name.

Rust

Core concepts

Ownership & borrowing

A value has exactly one owner; you may have many shared &T references or one exclusive &mut T, never both. Every lifetime bug class from C++ is forbidden up-front.
Lifetimes

The compiler needs to know how long references live to prevent dangling. Inferred most of the time, explicit 'a the rest.
Traits & generics

Like Haskell type classes meets C++ concepts. Zero-cost abstraction and trait bounds both live here.
Result / Option / ?

No exceptions — errors go through the type system. The ? operator makes propagation boilerplate-free.
Send / Sync

Marker traits that tell the compiler which types can cross threads and shared references. Data races are compile-time errors.
async/await + Future

A Future is a lazy state machine; tokio / async-std are the executors that poll it. Zero-cost, but you still need a runtime.
Unsafe Rust

An unsafe block buys you raw pointers, FFI, and low-level data structures. Most of the standard library sits on top of it.
Interior mutability

Cell / RefCell / Mutex / RwLock express "looks immutable from outside, actually mutable inside" within the type system.
Pattern matching & enums

match + tagged unions let you make illegal states unrepresentable.
Macros (declarative + proc)

Rust's macro system underpins serde, tokio, sqlx and is the official compile-time metaprogramming path.

Lab

Rustlings

Official interactive exercises. A few hours builds muscle memory for ownership, lifetimes, and traits.
The Rust Book projects

The grep clone and multithreaded web server that ship with the book. Reading plus typing in one go.
Writing an OS in Rust

Build an x86 kernel from scratch. Real-world no_std and unsafe.
Hand-roll a HashMap

Skip std::collections, implement open addressing yourself. Forces Rust generics and cache locality into the same mental model.
Tokio async server

Build an echo or HTTP server on tokio and watch the async mental model click.
Rust by Example

Official companion site of runnable snippets. Fastest way to look up a single idiom.

Reading

The Rust Programming Language

The official book. One book is enough to get started.
Rust for Rustaceans

Jon Gjengset's advanced book covering trait objects, unsafe, and async in practice.
Programming Rust

Blandy, Orendorff, and Tindall — a thorough systems-oriented intro.
Rustonomicon

The official unsafe-Rust black book. Required for FFI or low-level code.
Asynchronous Programming in Rust

The official async book. Explains why Future / Pin / executor look the way they do.
Jon Gjengset on YouTube

Deep live-coding streams on async runtimes, trait objects, and other advanced topics.
This Week in Rust

Community weekly newsletter. How you keep up with the ecosystem and nightly features.

Tools

cargo

Unified entry point for packaging, building, testing, and benchmarking.
clippy

Official lint catching dead patterns and style issues.
rustfmt

Official formatter. The Rust community basically stopped arguing about style.
rust-analyzer

LSP server that powers IDE completion, hover, and go-to-definition.
miri

A MIR interpreter that catches UB in unsafe code — OOB, uninitialized reads, data races.
cargo-expand

Expands macros into plain Rust. Indispensable when debugging proc-macros.
cargo-flamegraph

One command generates a flame graph for hotspot localization.

Go

Core concepts

Goroutines & the GMP scheduler

M goroutines multiplexed across N OS threads, with built-in work stealing.
Channels

The primitive for inter-goroutine communication. The underlying hchan is a ring buffer plus wait queues plus a lock.
select + context

Multi-channel selection plus cancel/timeout signals that propagate across call layers.
Interface

Implicit duck typing backed by an itab at runtime for method dispatch. interface{} / any is type erasure.
GC (concurrent mark-and-sweep)

Concurrent tri-color marking with a write barrier. STW target under 1 ms, paid for in memory overhead and CPU time.
defer / panic / recover

Stack-unwind hooks plus a limited stand-in for exceptions. Use panic sparingly.
sync package

Mutex / RWMutex / WaitGroup / Once / atomic — the standard lock primitives.
Generics (1.18+)

Type parameters with type constraints. Late arrival, but core now.
Error handling

Errors are values; errors.Is / errors.As / fmt.Errorf with %w build an error chain.
Go memory model

Defines happens-before, plus the synchronization guarantees of channels, mutexes, and atomics.

Lab

A Tour of Go

Official interactive walkthrough. Two hours covers syntax and concurrency basics.
Go by Example

Topic-organized runnable snippets. Fastest way to look something up.
Build a tiny HTTP framework

Add routing, middleware, and context propagation on top of net/http. Makes the boundaries of "what Go gives you" visible.
gRPC service

Generate stubs with protoc-gen-go, implement server, client, and interceptors. You absorb proto3 along the way.
Rate limiter / connection pool

The classic exercise in combining channels and sync primitives.
Writing an Interpreter in Go

Thorsten Ball builds the Monkey interpreter from scratch. Deep Go practice alongside language implementation.

Reading

The Go Programming Language

Donovan & Kernighan — the authoritative intro book.
Effective Go

The official style and idioms guide.
Go 101

A deep collection of advanced language details covering many corners.
Go official blog

First-hand design decisions and release analyses.
Practical Go (Dave Cheney)

Engineering-practice lectures on organizing and testing Go code.
Russ Cox

The Go team lead writing deep dives on language and toolchain decisions.
Go memory model

The formal happens-before spec. Worth reading before writing any concurrent Go.

Tools

go build / test / run

The all-in-one toolchain. Zero configuration.
go vet

Official static checker — catches wrong printf args and other classics.
gofmt / goimports

Formatting and import sorting. Go does not debate style.
golangci-lint

Community lint aggregator. The CI standard.
delve (dlv)

Go-specific debugger. Indispensable for inspecting goroutine and channel state.
pprof

Built-in profiler covering CPU, heap, block, and mutex.
race detector

Compile with -race and data-race detection runs during tests.

Operating Systems

The OS is the floor every piece of infrastructure stands on: how processes get scheduled, how memory is mapped, how I/O is multiplexed — all of it bounds what your services can do. The goal here isn't memorizing concepts but tearing apart a teaching kernel (xv6) that you can actually compile and run, so Linux's behavior has something concrete to compare against.

After this module you should be able to answer

Processes / virtual memory / fork

What do processes and threads share, and what don't they share? How is copy-on-write implemented after fork?
What problem does a multi-level page table solve? Roughly how expensive is a TLB miss, and what is a TLB shootdown?
Walk through how a page fault is handled. What's the difference between a major fault and a minor fault?
From a userspace syscall to kernel execution, what steps happen (trap, context switch, return)?
What happens to copy-on-write pages after fork? Why is fork + exec on a huge-memory process still relatively cheap?
What problem do huge pages and transparent huge pages solve? Why do databases often recommend disabling THP?
What do process states R / S / D / Z mean? Why can't even SIGKILL terminate a process in D state?
At which layer do ASLR, NX, and KASLR each operate, and which class of attacks do they block?

File systems / I/O / multiplexing

How does mmap's performance differ from read/write? When should you reach for mmap, and when is it actually slower?
What does epoll give you over select/poll? When do you pick edge-triggered vs level-triggered?
How does io_uring differ from epoll at its core? What problems does it solve that epoll can't?
Zero-copy syscalls (sendfile, splice, tee, MSG_ZEROCOPY) — which one fits which scenario?
The disk write path is write → page cache → writeback → device queue. Where exactly does fsync vs fdatasync wait?
How do ext4, XFS, and Btrfs differ in journaling, metadata concurrency, and snapshots? What should databases weigh when choosing?
What are the costs and benefits of direct I/O (O_DIRECT) bypassing the page cache? Why does PostgreSQL avoid it while MySQL/InnoDB uses it?

Scheduling / cgroups / containers

How do Linux's CFS and EEVDF schedulers allocate CPU time? What's the relationship between the nice value and cgroup cpu.weight?
What are the core differences between cgroup v1 and v2? Which controllers implement CPU and memory limits inside a container?
How does the OOM killer score processes (oom_score / oom_score_adj)? Why is the biggest memory hog not always killed first?
What's the relationship between futex and a user-space mutex? Why does the uncontended path avoid the kernel almost entirely?
Which namespaces (pid / net / mnt / uts / ipc / user) provide container isolation? Why is the user namespace the linchpin for security?
At which layer do seccomp, capabilities, AppArmor, and SELinux each restrict processes? Which syscalls does the default runc policy block?
What throttling artefact does combining cpu.cfs_period_us and cpu.cfs_quota_us produce? Why do teams often drop CPU limits on latency-sensitive services?

Performance & observability

How would you use strace to diagnose a stuck process? What kinds of problems are better suited to perf sampling vs bpftrace?
How does the USE method (Utilization / Saturation / Errors) work step by step when a box is thrashing?
Why does perf record need to pick among frame pointers / libunwind / DWARF for stack unwinding? How do they differ in overhead and accuracy?
Why is eBPF called "safely running code in the kernel"? What classes of bugs does the verifier prevent?
How does Intel's Top-down methodology (Frontend / Bad Speculation / Backend / Retiring) pinpoint where a hot loop actually stalls?

Core concepts

Processes / threads

The kernel's two basic scheduling units, one for isolation and one for sharing. Nail down what each owns: address space, file descriptors, signals.
Virtual memory

Every process sees its own contiguous address space, mapped to physical RAM via page tables and the MMU. Prerequisite for understanding fork, mmap, and OOM.
File systems

Inodes, directories, journals, the page cache. Once it clicks you can explain why fsync is expensive and why lots of small files are slow.
Concurrency and synchronization

Locks, condition variables, semaphores, lock-free structures. Unavoidable for multithreaded code and the direct target of the xv6 lock lab.
epoll

Linux's efficient I/O multiplexing, the foundation under Nginx and Redis event loops. Understanding LT vs ET modes is table stakes.
io_uring

Linux's newer async I/O interface that batches syscalls through shared ring buffers. High-throughput storage and network stacks are migrating to it.
mmap

Maps files or anonymous memory into a process's address space — the usual tool for shared memory and random access on large files. Misuse it and you get SIGBUS and weird write-back behavior.
/proc

The kernel exposes runtime state as a virtual filesystem. In production triage, /proc/<pid>/maps, status, and stack are the files you open most.
USE method

Brendan Gregg's triage framework: check Utilization, Saturation, and Errors on every resource. Your first sweep when a box hangs.
Top-down methodology

Intel's Frontend / Bad Speculation / Backend / Retiring four-bucket analysis. Tells you exactly which CPU stage a hot loop is stalling in.

Lab

MIT 6.1810 xv6 labs

A lab suite built on a teaching Unix-like kernel. Working through it turns syscalls, page tables, locks, and filesystems from words into code you've modified yourself.
util: Unix utilities

Write small tools (xargs, find, etc.) using xv6 syscalls. Warm-up lab to get familiar with the workflow.
syscall

Add a new syscall to the kernel. Walks you through the full trap table, argument passing, and return path.
pgtbl: page tables

Hand-manipulate RISC-V's three-level page table. Your mental model of address translation moves from PowerPoint to something you can draw at the register level.
lock

Refactor coarse kernel locks for better concurrency. Forces you to confront lock contention and the tradeoffs of splitting locks.
mmap

Implement mmap/munmap inside xv6. After this, Linux mmap behavior stops feeling mysterious.

Reading

OSTEP

Three-part structure (virtualization, concurrency, persistence), free online, plainspoken. The best candidate for a primary textbook.
MIT 6.1810

Home page for MIT's OS course — notes, videos, and labs all open. Pairs extremely well with OSTEP.
xv6 book

The line-by-line companion to xv6's source. Your most-consulted reference while doing the labs.
LWN

The authoritative news source for Linux kernel development. If you want to follow scheduler, memory, or io_uring evolution, this is the only game in town.
The Linux Programming Interface

Michael Kerrisk's encyclopedic reference on Linux systems programming. When you have a question about syscall behavior, it usually has the answer.
Systems Performance (Brendan Gregg)

The encyclopedia of Linux system performance. CPU, memory, disk, network, all through the USE method.

Tools

strace

Traces syscalls and signals of a running process. Stuck process, file not opening, network not connecting — strace first, ask questions later.
ltrace

Strace's library-call counterpart, showing libc and dynamic-library calls. Useful for debugging weird behavior at the glibc layer.
perf

Linux's native performance tool for sampling CPU, hardware events, and scheduler latency. The workhorse profiler.
bpftrace

A high-level tracing language on top of eBPF — one-liners can observe kernel events. Lighter than strace, more flexible than perf.
eBPF / bcc

Infrastructure for safely running sandboxed programs in the kernel; bcc is its Python toolkit. The de facto standard for modern Linux observability.
QEMU

Open-source machine emulator and what xv6 runs on top of. Indispensable for kernel debugging or playing with other architectures (RISC-V, ARM).
ftrace

In-kernel function tracer. Workhorse for syscall, scheduler, and IO stack investigations alongside bpftrace.
Flame Graphs

Folds stack samples into a visualization. First move after perf record.

Networking & RPC

The network is the layer every backend engineer uses daily but few have actually read the RFCs for. The goal here is not to turn you into a protocol expert, but to let you read a packet capture, explain what happens in a single HTTPS request, and tell whether a performance issue lives in the handshake, congestion control, or the application layer.

After this module you should be able to answer

TCP / transport

Why does three-way handshake suffice while two-way does not? What is TIME_WAIT in the four-way close actually protecting against?
When does a fresh TCP connection transition from slow start to congestion avoidance? How do Reno and CUBIC react differently to loss?
What are the core differences between BBR and CUBIC? Why is BBR better on long fat pipes but potentially worse on short-connection-dominated workloads?
What latency anti-pattern does Nagle combined with delayed ACK cause? When should TCP_NODELAY be on, and when off?
What's the relationship between MSS, MTU, and Path MTU Discovery? Why does a 1500-MTU link with VPN overhead often stall around the 1400-range?
In Wireshark, how do you visually distinguish retransmission, out-of-order delivery, and zero-window conditions?
How do Linux tcp_wmem / tcp_rmem / tcp_mem cooperate? How do you tune them for high-BDP links?

HTTP / QUIC

What HTTP/1.1 problem does HTTP/2 multiplexing solve? Why is it still vulnerable to TCP head-of-line blocking while HTTP/3 is not?
Why is QUIC built on UDP instead of as a new L4 protocol? How does it implement connection migration?
For a single `curl https://example.com`, which syscalls and network round trips happen in order from DNS lookup to first byte?
What do chunked, keep-alive, and pipelining each solve in HTTP/1.1? Why did pipelining never catch on?
How is HTTP/3 0-RTT resumption related to TLS 1.3 0-RTT? How should applications defend against replays?

TLS / crypto

What round trip did TLS 1.3 eliminate compared to 1.2? What is the security cost of 0-RTT?
What does mTLS add over one-way TLS? What does the typical sidecar-injected mTLS flow look like in a service mesh?
In a TLS handshake, what do ECDHE, certificate verification, and Finished each accomplish? Which step guarantees forward secrecy?
What availability and latency problems do CRL, OCSP, and OCSP stapling each introduce?

RPC / application layer

How does gRPC map bidirectional streams and status codes onto HTTP/2 streams and trailers?
Which HTTP methods are idempotent, and why can naive client retries trigger cascading failures?
How does a gRPC deadline propagate down the call chain? How are metadata, headers, and context carried?
What are the trade-offs between client-side load balancing (lookaside / proxyless xDS) and classic L4/L7 proxies?

Kernel networking / load balancing

What problem does proxy protocol solve? How does the real client IP survive through an L4 load balancer?
What are the typical symptoms of a full ip_conntrack table? When is it better to simply disable conntrack?
How does Cilium / eBPF as a kube-proxy replacement compare to iptables / IPVS on performance and observability?
What's the difference between SO_REUSEPORT and plain bind? When multiple processes listen on the same port, how does the kernel distribute connections?
At which layer do RSS / RPS / RFS / XPS each distribute traffic on a Linux NIC? What do you tune at high packets-per-second?
How do DPDK, XDP, and AF_XDP compare as user-space packet paths on performance and programming model?

Core concepts

TCP handshake / close

RFC 9293 is the 2022 consolidated TCP spec. Read it to internalize the SYN/ACK/FIN state machine, not to memorize trivia.
TCP congestion control

Slow start, congestion avoidance, fast retransmit, fast recovery. Almost every weird throughput problem traces back to this diagram.
HTTP/1.1

Still the protocol most servers actually handle. Understanding keep-alive, chunked encoding, and pipelining pitfalls is the floor.
HTTP/2

Introduces binary framing, multiplexing, and HPACK. Needed to read gRPC and to see why one stuck TCP connection stalls all streams.
HTTP/3

HTTP semantics over QUIC. Focus on how it avoids TCP head-of-line blocking and supports connection migration.
QUIC

Reliable transport plus crypto plus multiplexing, rebuilt on UDP. The new foundation for the modern network stack.
TLS 1.3

Drops a round trip and kills legacy algorithms. Knowing the 1.2 vs 1.3 handshake difference unlocks almost any capture.
DNS

The first thing in your request path that can fail. Recursive resolution, TTL, and caching are prerequisites for debugging.
gRPC protocol

One page explaining how gRPC maps onto HTTP/2 frames. Faster than reading source.

Lab

Hand-roll a TCP echo

Implement echo server and client with raw sockets. The first time you see the real shape of listen, accept, read, write.
Build a minimal HTTP server

Parse the request line yourself and serve a GET. Demystifies every web framework you have used.
Build Your Own X

An index of from-scratch implementation projects. Pick one of the network-adjacent ones and finish it.
gRPC official quickstart

Spin up a gRPC service in half an hour, then capture traffic to see what it actually sends over HTTP/2.
tcpdump / Wireshark capture practice

Use the official sample captures to learn to spot handshakes, retransmits, and TLS ClientHello. Pattern recognition is the skill.
Codecrafters: HTTP server

Staged challenges that build an HTTP server including persistent connections and gzip. More disciplined than winging it yourself.

Reading

TCP/IP Illustrated, Vol. 1

Stevens' classic. Use it as a reference, not cover-to-cover reading.
High Performance Browser Networking

Ilya Grigorik's free online book, from physical layer up through HTTP/2. The clearest performance-oriented treatment out there.
brpc design docs

Design docs for Baidu's open-source RPC framework. Covers every hard problem an industrial RPC has to solve.
gRPC design and implementation

The official blog has deep posts on retry, xDS, and flow control. Closer to engineering reality than the reference docs.
Cloudflare QUIC blog series

QUIC and HTTP/3 tuning war stories from actual operators. Best material for understanding real-world deployment.
Beej's Guide to Network Programming

The gold standard intro to socket programming. Reads like prose yet explains every API fully.

Tools

tcpdump

Command-line packet capture. First tool to reach for on a server; invest time in its filter syntax.
Wireshark

GUI protocol analyzer. Open a tcpdump capture locally and see TLS and HTTP/2 frames decoded with a click.
iperf3

The standard tool for measuring bandwidth and loss. Run it first when deciding whether a problem is network or app.
mtr

Traceroute plus ping in one view. Invaluable for diagnosing cross-region or cross-ISP packet loss.
curl

The universal HTTP client. With -v and --trace it prints almost everything you might want to inspect.
toxiproxy

Proxy that injects latency, loss, and bandwidth limits at the TCP layer. Lightweight option for network fault drills.

Distributed Systems

Distributed systems is the study of what guarantees you can still offer when machines, networks, and clocks are all free to lie to you. The goal of this module is to build the vocabulary — consistency models, consensus, replication, failure models — so you can read papers, design storage, and reason about production anomalies in shared language.

After this module you should be able to answer

Consistency models & theory

Why is the C in CAP not the same C as in ACID? In a real partition, how do engineers actually trade A against C?
FLP says consensus is impossible in an asynchronous system. Why does Raft work anyway — which assumption does it quietly relax?
Give concrete scenarios that distinguish linearizable, sequential, causal, and eventual consistency from one another.
What's the threshold difference between BFT and CFT (n ≥ 3f+1 vs n ≥ 2f+1)? Outside of blockchains, where else does BFT become a hard requirement?
What extra dimension does PACELC add to CAP? Which side of real-world design does it illuminate better?

Consensus & replication

Why does Raft require a candidate's log to be 'at least as up-to-date as the majority' during election? What goes wrong without it?
How many failures can a 5-node cluster tolerate? What do you gain and lose by going to 7 nodes?
What exact scenario causes 2PC to block? How do Paxos/Raft avoid that class of blocking?
How do Raft's read index and lease read each implement linearizable reads, and what are the costs?
What is joint consensus? Why does Raft use it for config changes instead of an atomic cutover?
What problem do Raft snapshots + log compaction solve? What's the install-snapshot flow when a follower falls too far behind?
How do Multi-Paxos, EPaxos, and Raft differ in latency profile for geo-distributed deployments?

Transactions & MVCC

Which anomalies do transaction isolation levels (RC / RR / SI / SSI / Serializable) each permit? Why isn't Snapshot Isolation serializable?
What's the core idea behind cross-shard transaction models like Percolator and Omid? How do they differ from 2PC?
Under MVCC, how is the snapshot visible to a transaction determined? Why are vacuum/GC unavoidable costs?
How do pessimistic and optimistic locking compare on tail latency in cross-shard transactions? Why does OCC collapse under high contention?
How does deterministic database design (Calvin / FaunaDB) sidestep 2PC? What is the trade-off?

Clocks & time

What guarantee does Spanner's TrueTime provide? How would a team without GPS plus atomic clocks approximate external consistency?
What guarantees do Lamport clocks, vector clocks, and HLC each provide? What role does HLC play inside CockroachDB?
In an eventually consistent system, how do version vectors or CRDTs merge concurrent writes without relying on wall clocks?
What can go wrong in a typical distributed DB if NTP drifts by 50 ms? How short can a leader lease realistically be?

Engineering & failures

What are the trade-offs between consistent hashing and range-based sharding? What problem do virtual nodes (vnodes) solve?
Under what conditions does split-brain happen? How do leases and fencing tokens combine to prevent two 'leaders' from writing concurrently?
What are the five most common classes of consistency bugs in Jepsen reports? Why did early MongoDB keep tripping on them?
Why is gray failure (half-dead nodes) harder than crash failure? Which systems explicitly design around it?
What preconditions do you need before chaos engineering is safe in production? How does it complement Jepsen-style model checking?

Core concepts

CAP theorem

Under a partition you pick consistency or availability. One of the most abused concepts in the field; clarify what it does and does not claim.
FLP impossibility

No deterministic consensus in an asynchronous network with even one crash. Knowing it explains why Raft needs timeouts.
Linearizability

The strongest single-object model: every op appears to take effect atomically at some instant. This is what etcd and ZooKeeper give you.
Eventual consistency

The weak-consistency model behind Dynamo and Cassandra. Jepsen's model map is the clearest reference diagram in the area.
Leader election

The first step of most consensus and replication protocols. Figuring out 'who is in charge' is where every consistency discussion starts.
Log replication

Turn operations into a replayable log, ship it to a majority, then apply to the state machine. The underlying pattern of modern storage.
Quorum / majority

Arithmetic constraints like R + W > N. Understanding them lets you derive the consistency strength of any given configuration.
Two-phase commit (2PC)

The textbook approach to cross-resource transactions and the textbook cautionary tale about coordinator blocking.

Lab

MIT 6.5840 Lab 2: Raft

Implement Raft from scratch in Go. Only after finishing this do election, log matching, and safety really click.
MIT 6.5840 Lab 3: KVRaft

Build a linearizable KV on top of your Lab 2 Raft. Teaches idempotency and client sessions against duplicate requests.
MIT 6.5840 Lab 4: ShardKV

Multiple Raft groups plus shard migration. Closely mirrors production KVs like TiKV and CockroachDB.
PingCAP Talent Plan

PingCAP's engineering-flavored distributed systems course. A good complement to the MIT labs with a more practical tilt.
dragonboat examples

Sample projects for a production-grade Go Multi-Raft library. Comparing it to your Lab 2 teaches real engineering nuance.
TigerBeetle simulation testing

A distributed database running deterministic simulations in the browser. Understanding it shows you the ceiling of modern fault-injection testing.

Reading

GFS paper

The original Google File System paper. HDFS and a generation of distributed storage descend from it; read for the pattern, not the details.
MapReduce paper

Kicked off the big-data era. You may not use MR today, but its take on failure handling and retries shaped a whole generation of systems.
Raft paper

A consensus algorithm deliberately designed for understandability. Reading it plus doing Lab 2 essentially nails down consensus.
Paxos Made Simple

Lamport's own 'simple' version. Still brain-bending, but required reading to understand why Raft looks the way it does.
Spanner paper

Globally distributed strongly consistent database; TrueTime is the key innovation. Every strong-consistency cloud DB today chases it.
Dynamo paper

The blueprint for eventually consistent KVs. NWR, vector clocks, gossip, and consistent hashing all in one place.
DDIA

Kleppmann's Designing Data-Intensive Applications. The only book in this module worth reading cover to cover.
Martin Kleppmann blog

The DDIA author's blog. His takes on consistency, clocks, and stream processing are sharper and more current than the book.

Tools

Jepsen

Kyle Kingsbury's distributed systems fault-testing framework and blog. Required reading for anyone shipping distributed storage.
etcd

Production-grade Raft implementation and the backing store for Kubernetes. Reading its source is closer to engineering than reading the paper.
ZooKeeper

The veteran coordination service, built on ZAB. A huge pile of legacy systems rely on it for locks and leader election.
TLA+

Lamport's formal specification language. Use it to model a protocol before you code it and let the checker find bugs for you.
Chaos Mesh

Chaos engineering platform for Kubernetes. Inject partitions, node crashes, and latency to verify your system actually survives.
toxiproxy

Lightweight network fault-injection proxy. Handy for simulating partitions and latency inside integration tests.

Data Infra

Data infrastructure is the plumbing between raw writes and analytical queries: how storage engines trade off read vs write amplification, how columnar formats accelerate scans, how stream processors produce correct results under out-of-order events and failures, and how lakehouse table formats bolt ACID onto object stores. Understanding this layer is what lets you reason about modern OLAP, real-time warehouses, and lakehouses instead of just using them.

After this module you should be able to answer

Storage engines (LSM / B+ Tree)

Why are LSM-Trees fast to write but potentially slow to read? What problem is compaction solving, and what new amplification does it introduce?
Compare B+ Tree and LSM-Tree on write amplification, read amplification, and space amplification. Which workloads favor which?
Which write patterns suit RocksDB's leveled, universal, and FIFO compaction? How do you balance the three amplifications?
Why does LSM default to bloom filters? When can a bloom filter actually slow reads down?
RocksDB's LSM compaction amplifies writes, so why does it still beat a B-Tree on SSDs? Which section of compaction_job.cc makes this clearest?
Why did Redis pick single-threaded IO multiplexing over threads? Point to the latency vs throughput trade-off in ae.c and networking.c.
Why does Redis Cluster use fixed 16384 hash slots instead of consistent hashing?
How do deterministic-simulation databases (FoundationDB / TigerBeetle) differ from RocksDB-style engines at the storage layer?

Columnar & query execution

Why does Parquet split data into row group + column chunk + page? When do dictionary encoding and RLE actually pay off, and when do they hurt?
What problem do Dremel's repetition and definition levels solve for nested data? Why can't you just flatten everything into scalar columns?
Why is vectorized execution faster than the Volcano iterator model? Why can ClickHouse outrun Spark SQL by an order of magnitude?
How do ORC and Parquet differ in footer / stripe / bloom filter layouts? Why does Hive lean toward ORC while the Spark ecosystem leans Parquet?
What does Arrow (in-memory columnar) solve for zero-copy data movement compared to Parquet? Where does Flight fit?
What flavor of 'LSM' is ClickHouse's MergeTree? What roles do the primary index, skip index, and parts play?

Stream processing (Kafka / Flink)

How is Kafka exactly-once actually implemented? What is the division of labor between idempotent producer, transactions, and read_committed consumers?
What is a Flink watermark really? When a late event arrives, which knobs (allowedLateness, side output, trigger) decide what happens to the window?
What do Kafka's ISR, leader election, and unclean leader election each mean for consistency?
Once Kafka tiered storage moves cold data to S3, what new metadata and latency problems appear, and how are they solved?
What do Flink's aligned vs unaligned checkpoints each solve, and what are the trade-offs?
What pitfalls come from mixing event time and processing time? How do you design for late data and corrections?
How do Flink's state backends (memory / rocksdb) trade off checkpoint size against recovery time?

Lakehouse / transactions

How do Iceberg and Delta Lake achieve ACID on an object store that only offers PUT? Walk through snapshots, manifests, and how commit conflicts are resolved.
What does a 'transaction' on S3-backed Lakehouse actually mean? How do Iceberg, Delta, and Hudi differ in resolving commit conflicts?
How does Iceberg's three-tier metadata (metadata JSON → manifest list → manifests) enable time travel and partition evolution?
Which read/write ratios fit Hudi's CoW vs MoR tables? How does Hudi MoR compare to Iceberg MoR on merge strategy?

MVCC / schema evolution / CDC

Under MVCC, how is the snapshot visible to a transaction determined? Why are vacuum/GC unavoidable costs of MVCC?
Why does CDC (Debezium et al.) parse binlog / WAL directly? How does the downstream stay stable across upstream schema changes?
What compatibility rules do Avro, Protobuf, and JSON each define for schema evolution (add / drop / retype fields)?
When do Debezium's snapshot + incremental mode vs log-only mode each fit?
What is the core value of SQL-native ETL tools like dbt over traditional Airflow + hand-written SQL? How do they handle testing and lineage?

Core concepts

LSM-Tree

The write-optimized, level-structured store. Memtable/SSTable/compaction is the shared foundation of every modern KV engine (RocksDB, Cassandra, ScyllaDB).
B+ Tree

The default index of classic relational databases. Read it alongside LSM to see why OLTP and KV systems make different engineering trade-offs.
Parquet columnar format

The de-facto columnar file format for analytics. Knowing its row group / page / encoding layout is why OLAP scans beat row stores by an order of magnitude.
Dremel / nested columnar

The idea behind Parquet's and BigQuery's nested columnar storage. Repetition/definition levels are the trick for columnar-izing JSON-like data.
MVCC

What lets Postgres, InnoDB, and most modern OLTP engines keep reads and writes from blocking each other. Understanding snapshots and visibility is table stakes for reading their code.
Kafka log storage

The canonical append-only log + segment + index design. It is the substrate for stream processing, CDC, and event-sourced architectures.
Flink watermark

The mechanism for reasoning about out-of-order events in a stream. Without it, event-time windows and late-data handling make no sense.
Exactly-once

One of the most misunderstood terms in streaming. This post ties idempotent producer, transactions, and consumer isolation into one coherent story.
Iceberg table format

One of the two dominant open lakehouse specs. Its snapshot/manifest/metadata layout is what enables schema evolution and time travel on object storage.
Delta Lake

Databricks's competing lakehouse protocol. Comparing it to Iceberg highlights different trade-offs in transaction-log design.

Lab

CMU 15-445 BusTub

Buffer pool, B+ Tree, concurrency control, transactions — the best open course for systematically building a database kernel from scratch.
Write a mini-LSM KV

Implement memtable, SSTable, and compaction step by step in Rust. Nothing teaches LSM internals like writing one.
Implement a Bloom Filter

Dozens of lines of code, but it shows up in LSM, databases, and caches everywhere. Writing one cements the hash-count vs false-positive trade-off.
Kafka producer/consumer

Run the official quickstart end to end to build a real mental model of topic/partition/offset before tackling stream processing.
Flink WordCount

The smallest runnable DataStream example. Use it to feel how keyBy, windows, and watermarks actually compose.
DuckDB query experiments

Embedded OLAP that runs vectorized queries over Parquet on your laptop. Running EXPLAIN on real datasets makes query plans concrete fast.
Read Redis source

Start with ae.c, networking.c, t_string.c to see the single-threaded event loop and the memory packing tricks in SDS and ziplist.
Read LevelDB / RocksDB

Industrial reference for LSM-trees. Read LevelDB for the skeleton, then jump to RocksDB for compaction and prefix bloom in production.
Read ClickHouse

A textbook for columnar storage plus vectorized execution. Study the inner loops of AggregatingTransform and ColumnVector.

Reading

Dremel paper

The founding paper for nested columnar storage and interactive SQL engines, and the intellectual core of BigQuery.
Kudu paper

A landmark attempt at HTAP columnar storage; clearly explains why OLAP column stores struggle with low-latency random writes.
C-Store columnar

The seminal columnar OLAP paper and ancestor of Vertica. Its projection/compression/sort-column ideas are still in use today.
Kafka paper

The 2011 original design paper. Short, but it lays out the trade-offs of a log-centric architecture clearly.
Spanner paper

The reference for TrueTime and globally consistent transactions. It directly shaped CockroachDB, YugabyteDB, and others.
DDIA

Kleppmann's Designing Data-Intensive Applications, the single most recommended survey book in the data-systems space.

Tools

DuckDB

A single-node vectorized OLAP engine. Great for local Parquet/CSV analysis and a fun target for studying columnar execution.
ClickHouse

A production-grade columnar OLAP database with best-in-class query speed. Its MergeTree source code is a goldmine of engineering tricks.
RocksDB

The most widely embedded LSM KV engine in industry — MySQL, TiDB, CockroachDB, Kafka Streams, and many more rely on it.
Kafka CLI

kafka-console-producer/consumer/topics are your first line of defense when debugging a Kafka cluster.
Spark

The de-facto batch engine and the main compute layer in the lakehouse stack. Catalyst/Tungsten are how you learn modern SQL optimizers.
Flink

The de-facto stream processor and the industrial reference for state, checkpoints, and exactly-once execution.

AI Infra

AI infrastructure is about how models actually run on hardware: how training shards parameters, gradients, and optimizer state across GPUs and nodes; how inference keeps the KV cache and scheduler efficient; and how compilers lower operator graphs onto CUDA cores and Tensor Cores. This layer is what explains why the same model has wildly different latency across frameworks, and where cost optimization actually lives.

After this module you should be able to answer

GPU & CUDA basics

How do CUDA threads / warps / blocks / grids map onto SMs? Why are block sizes usually multiples of 128 or 256?
What is memory coalescing, and how many times can a non-coalesced warp access waste bandwidth?
What are the rough bandwidth and latency gaps between HBM, L2, SMEM, and registers? Which level does a "slow kernel" usually bottleneck on?
Under the roofline model, what does it mean for a kernel to be compute-bound vs memory-bound, and how does that change the optimization direction?

Training parallelism

What exactly do data, tensor, and pipeline parallelism partition? Why must a 70B training run combine all three rather than rely on DP alone?
What do ZeRO stages 1/2/3 shard? How does ZeRO relate to FSDP, and how much does communication cost grow compared to vanilla DP?
Where does the pipeline-parallel bubble come from? How do 1F1B, interleaved 1F1B, and zero-bubble schemes each shrink it?
What gap does sequence / context parallelism fill that tensor parallelism can't cover?
How is compute/comm overlap actually achieved? Which operators most often end up on the critical path when you combine NCCL streams and buffers?
Gradient checkpointing trades what for what? What's the typical ratio between saved memory and extra compute?

Inference optimization

What is the formula for KV cache memory, and why does it exceed the model weights for long-context inference?
What specific problem does PagedAttention solve for KV cache, and why does it push memory utilization from 20-40% to 90%+?
How does continuous batching differ from static batching, and why is the throughput win so large specifically for LLM inference?
Why is FlashAttention fast? Is it an algorithmic win or a memory-access win, and why does it claim to leave attention's math unchanged?
How do prefill and decode phases differ computationally? How much throughput does PD-disaggregated serving actually gain?
How do Medusa, EAGLE, and Lookahead generate draft tokens in speculative decoding? What caps their achievable speedup?
How much KV can prefix caching save when system prompts and multi-turn dialogs are shared? How does that relate to SGLang's radix tree?

Precision & quantization

What are the numerical-range differences between fp16, bf16, and fp8? Why does fp16 need loss scaling when bf16 usually doesn't?
How do INT8, INT4, AWQ, and GPTQ quantization schemes differ? Where does accuracy degradation become unacceptable?
How much does fp8 training (Hopper E4M3 / E5M2) save over bf16, and what are the convergence risks?
How does quantizing the KV cache to INT8 / INT4 affect long-context inference latency, and which architectures are most prone to accuracy loss?

Model architecture (MoE / GQA)

How do LoRA / QLoRA trade memory for quality vs full fine-tuning? Which layers give the best bang for the buck?
What are the three hardest problems in MoE expert-parallel implementations (routing, all-to-all, load imbalance)?
What do GQA / MQA save compared to MHA? Why do nearly all new models pick GQA?
How do DeepSeek's fine-grained experts + shared experts differ in engineering from Mixtral's top-2 routing?
What does multi-head latent attention (MLA) add beyond GQA on KV-cache compression?

Metrics & evaluation

What's the difference between MFU and HFU? What levels does industrial training usually hit?
How does vLLM's PagedAttention eliminate KV-cache fragmentation, and which OS mechanism is it borrowing from?
How should you decompose an inference SLO across TTFT, TPOT, and end-to-end latency when request sizes vary wildly?

Core concepts

CUDA thread / block / grid

The foundational abstraction of GPU programming. Without it, kernel tuning and occupancy are just magic numbers.
Shared memory / warp

A warp is the minimal scheduling/execution unit on a GPU; shared memory is the fast intra-block channel. Together they are the core tools for high-performance kernels.
Memory coalescing

Fusing 32 threads of a warp into one memory transaction is a prerequisite for saturating GPU bandwidth.
Data / Tensor / Pipeline Parallel

The Megatron-LM systems paper is the clearest treatment of combining all three. You cannot train modern LLMs without it.
ZeRO

Shards optimizer state, gradients, and parameters across a data-parallel group — the theoretical basis of DeepSpeed and FSDP.
KV Cache

The per-token state that makes autoregressive generation tractable. It sets the floor of LLM inference memory and the ceiling of scheduling.
PagedAttention

vLLM's key contribution: apply OS-style virtual-memory paging to the KV cache and kill fragmentation.
Continuous Batching

Form batches at token granularity rather than waiting for whole requests. Typically 2–10x throughput for LLM serving.
Speculative Decoding

A small draft model proposes, the big model verifies in parallel. The mainstream way to cut latency without changing model quality.
FlashAttention

Uses tiling and recomputation to minimize HBM traffic for attention. One of the most important kernel-level wins of the past few years.

Lab

dlsys needle assignments

Tianqi Chen's deep-learning-systems course. You build from autograd up to CUDA kernels and see a DL framework end to end.
CMU 15-418 assignments

The classic parallel-computing course with labs in ISPC, CUDA, and MPI. The best way to build a GPU mental model.
Reproduce llm.c

Karpathy's pure-C/CUDA GPT training. Reading it turns PyTorch from a black box into a glass box.
yalm: LLM inference from scratch

Andrew Chan's pure C++/CUDA single-GPU inference engine with zero external deps. Hits 63.8 tok/s on Mistral-7B, matching or beating llama.cpp. If llm.c is training, yalm is inference — the complementary project on the same axis.
GPU Puzzles

Sasha Rush's 14 interactive puzzles targeting warps, shared memory, and reductions. A few hours well spent for GPU intuition.
Write your own autograd

~100 lines of Python that implement backprop. After this, PyTorch's computation graph stops feeling magical.
Triton tutorials

OpenAI's GPU DSL with a much lower barrier than CUDA. The lingua franca for FlashAttention and fused kernels today.
Read vLLM source

Open-source implementation of PagedAttention and continuous batching. Primary source for LLM inference performance.

Reading

Attention Is All You Need

Vaswani et al.'s 2017 transformer paper. The shared foundation under all of LLM infrastructure — self-attention, multi-head, and positional encoding all come from here.
Mixed Precision Training

NVIDIA's 2017 fp16 training paper. Loss scaling and master weights — the defaults in every modern training framework — originate here.
GPipe

Google's founding paper on pipeline parallelism: slice the model by layers and fill the bubble with micro-batches. Every subsequent schedule — 1F1B, interleaved 1F1B, zero-bubble — is an iteration on it.
Megatron-LM

NVIDIA's large-model training systems paper and the canonical source for tensor parallelism.
Megatron 3D (SC'21)

Megatron team's SC'21 follow-up that combines data / tensor / pipeline parallelism on thousands of A100s. The de-facto reference architecture for modern LLM training — virtually every team starts from its blueprint.
ZeRO paper

DeepSpeed's sharded-optimizer paper and the theoretical basis for FSDP and modern large-scale training.
Switch Transformer

Google scaling MoE to trillion parameters. Its design choices — Top-1 routing, load-balancing loss, expert capacity — shaped an entire generation of MoE models including Mixtral and DeepSeek MoE.
FlashAttention

A textbook example of IO-aware kernel design. Reading it teaches how HBM vs SRAM bandwidth dictates kernel structure.
FlashAttention-2

Tri Dao's v2: re-partitions thread blocks and cuts non-matmul work, giving another 2x on attention kernels on A100. The default attention impl in virtually every inference framework today.
FlashAttention-3

Hopper-specific redesign using wgmma, asynchronous TMA, and warp specialization to hit near hardware speed-of-light on H100 for fp16/fp8 attention.
Orca (OSDI'22)

The foundational paper on continuous batching + iteration-level scheduling. Every LLM scheduler in vLLM, TensorRT-LLM, and SGLang traces its core idea back here.
vLLM PagedAttention

The milestone systems paper for LLM inference. It formalizes scheduling and memory management as first-class problems.
DistServe (OSDI'24)

The canonical paper on prefill/decode disaggregation. Pushes a new Pareto frontier between latency and throughput by decoupling the two phases — one of the most important architectural shifts in current inference stacks.
Speculative Decoding

Leviathan et al.'s 2022 original paper. The 'small drafter + large verifier in parallel' paradigm — Medusa, EAGLE, and Lookahead are all descendants.
GPTQ

Single-pass INT4 post-training quantization with negligible accuracy loss. Almost every INT4 weight in the llama.cpp / HuggingFace ecosystem goes through this.
AWQ

Activation-aware weight quantization: identify the 'salient' channels from activation statistics and keep them in higher precision. Splits the production INT4 market with GPTQ.
MLIR paper

The LLVM team's new IR infrastructure aimed at AI and heterogeneous compilers. Essential background for understanding modern compiler stacks.
TVM paper

The seminal deep-learning compiler paper. Its schedule/compute split deeply influenced Triton and the Halide community.
Fast LLM Inference From Scratch

Andrew Chan's long-form companion to yalm. Builds from naive code up to production-competitive performance, covering OpenMP + AVX, warp reductions, kernel fusion, attention kernels, KV cache quantization, and manual unrolling and prefetching. The clearest single-node writeup on inference optimization in recent years.

Tools

PyTorch

The de-facto deep-learning framework — virtually every open-source model ships as a PyTorch checkpoint. torch.compile, FSDP2, and DTensor are the current pillars of the ecosystem.
JAX

Google's functional DL framework on top of XLA. Shines on TPUs and at large-scale training (Gemini, Mixtral training, etc.).
Megatron-LM

NVIDIA's open-source training framework. Reference implementation of tensor / pipeline / sequence parallelism, and the common starting point for GPT-3-scale training and above.
TorchTitan

PyTorch team's native 4D-parallel training library (2024+). Skips Megatron by building directly on DTensor / FSDP2 — PyTorch's own reference training stack.
DeepSpeed

Microsoft's training optimization library and the primary implementation of ZeRO. A common choice for large-model training.
TransformerEngine

NVIDIA's fp8 training / inference kernel library and the de-facto standard for fp8 on Hopper and Blackwell. Tightly integrated with Megatron-LM.
accelerate

HuggingFace's thin wrapper over multi-GPU, mixed precision, and FSDP. The easiest path from single-GPU to multi-GPU training.
FlashAttention

Official implementation of the FlashAttention papers (v1 / v2 / v3). Directly integrated into PyTorch 2.2+ SDPA.
xformers

Meta's high-performance transformer operators — memory-efficient attention, SwiGLU, ALiBi, and more.
Triton

An order of magnitude easier than CUDA for writing GPU kernels. The main implementation language for production fused attention and quantization kernels.
vLLM

The de-facto LLM inference engine today and the industrial reference for PagedAttention plus continuous batching.
TensorRT-LLM

NVIDIA's flagship LLM inference engine. At high concurrency it runs 30-50% faster than vLLM, paid for with a heavier compilation step and narrower model coverage. Powers most large cloud serving stacks.
SGLang

The rising star: RadixAttention keeps prefix-cache hit rates very high, and it often beats vLLM on newer architectures like DeepSeek-V3. The first pick for structured generation and multi-turn workloads.
llama.cpp

Pure C/C++ local inference engine running on Mac Metal, CPU, CUDA, and Vulkan. GGUF + INT4 quantization is the de-facto standard for consumer-grade local LLMs.
TGI

HuggingFace's inference server — once the most popular production option. Now in official maintenance mode with vLLM/SGLang recommended instead, but still widely deployed.
bitsandbytes

De-facto runtime for int8 / nf4 quantization. Behind QLoRA, HuggingFace Transformers' load_in_8bit / load_in_4bit, and most consumer-grade quantization.
Unsloth

Custom Triton kernels for LoRA / QLoRA delivering 2x speed and 50%+ VRAM reduction over plain HuggingFace. The first choice for fine-tuning on consumer GPUs.
Ray

Distributed Python computing — Ray Train / Ray Serve / Ray Data form a full ML infrastructure stack. Multi-node vLLM and SGLang both run on top of it.
PyTorch profiler

The first tool to reach for when analyzing training/inference performance. Gives per-op CPU/GPU time and memory allocations.
Nsight

NVIDIA's official GPU profiler. Kernel timing, SM occupancy, and memory-bandwidth bottlenecks all live here.

CUDA / GPU Programming

A GPU isn't a 'many-core CPU' — it's a throughput-oriented massively parallel machine: thousands of registers per SM, hundreds of KB of shared memory, tens of thousands of threads in flight. This module pulls apart CUDA's execution model, memory hierarchy, and sync primitives from the hardware angle, so you can write your own kernels, read the inner loops of CUTLASS and FlashAttention, and know from a single Nsight Compute metric where to optimize.

After this module you should be able to answer

Execution model & warps

What does a kernel launch go through between the host call and the GPU starting execution (driver, runtime, queue, command processor, SM)? Roughly what fixed overhead does one launch cost?
How many instructions can an SM's warp scheduler issue per cycle? Is 100% occupancy always fastest — why did occupancy become less critical after Volta?
What is warp divergence? If 16 threads in a warp take the if and 16 take the else, how does the hardware execute it, and what is the cost?
When do Cooperative Groups and grid.sync fit? How does combining them with persistent kernels eliminate launch overhead?
Where do CUDA Graph gains over per-launch dispatch actually come from? When is there no meaningful win?

Memory hierarchy & access

What causes a shared-memory bank conflict? When 32 threads in a warp hit different addresses in the same bank, how many serialized transactions does it become?
What are the rough latencies and bandwidths of register / shared / L1 / L2 / HBM? Given a data-reuse pattern, where would you place the data?
With ~192 KB combined shared memory and L1 on an SM, how do you configure a kernel to avoid register spills into local memory?
How much bandwidth do vectorized loads (float4, ldmatrix) recover? Why is it almost mandatory for writing GEMM?
What's the cost of unified memory (cudaMallocManaged) page migration? Which workloads should fall back to explicit cudaMemcpy?
What does async copy (cp.async) save compared to the classic global → shared path through registers? How does Hopper extend it over Ampere?

Tensor Cores / MMA

What's the difference between a Tensor Core and a regular CUDA core? When do you use the wmma API versus writing mma PTX directly?
How does CUTLASS 3.x's CuTe layout abstraction differ qualitatively from the 2.x tile iterator?
How does wgmma (Hopper) differ from mma (Ampere) in execution granularity and asynchrony?
Why is FlashAttention-3 another 1.5-2x faster on Hopper? Which features of warp specialization and wgmma does it exploit?

Multi-GPU communication

How is concurrency across cudaStreams implemented? What are the relative costs of event, graph, and barrier synchronization?
Why does NCCL ring all-reduce hit near peak NVLink bandwidth? At 64-GPU scale, would a tree algorithm do better?
What are the rough bandwidth/latency levels of NVSwitch, NVLink, PCIe, and InfiniBand? How should a training cluster be topologically organized?
What does SHARP (NVIDIA's in-network reduction) save over traditional ring all-reduce, and what are the deployment constraints?

New hardware (Hopper / Blackwell)

What new capabilities do Hopper's thread block cluster and distributed shared memory unlock?
What flavors of TMA swizzling exist? Why must you pick the right one to pair with wgmma?
What extra precision constraints do fp8 tensor cores (Hopper FP8 / Blackwell FP4) impose on training and inference?
What does Blackwell's 2nd-gen Transformer Engine plus FP4 tensor core add on top of Hopper?
What does MPS (Multi-Process Service) solve? How does it differ from MIG on usage and isolation?

Tuning tools

How do you read Nsight Compute's Speed of Light (SOL) metric? What root causes do 'long scoreboard', 'short scoreboard', and 'barrier' stalls indicate?
On the Nsight Systems timeline, how do you tell 'CPU launching too slowly' apart from 'GPU actually idle'?
Which fields of nvcc's --ptxas-options=-v output are most useful when tuning occupancy and register budget?

Core concepts

GPU architecture (SM / warp scheduler / register file)

The execution units, register file, and L1 / shared layout inside a single SM drive kernel design. Knowing SM count, warp width, and register budget is the starting point for all tuning.
Execution model (grid / block / warp / thread)

Which SM a block lands on and how warps get scheduled determine occupancy and latency hiding. Not just a concept — see how it maps onto the hardware.
Memory hierarchy

Register / shared / L1 / L2 / global / HBM differ by 2–3 orders of magnitude in bandwidth and latency. 90% of GPU perf work is answering 'which tier does this data belong to?'
Memory coalescing

Merging 32 threads of a warp into a single global memory transaction is the precondition for saturating HBM bandwidth.
Shared memory & bank conflicts

Shared memory is split into 32 banks; different words in the same bank serialize. Padding and swizzling are the two standard workarounds.
Warp-level primitives

shfl_sync / ballot_sync / reduce_sync and friends. High-performance reduce, scan, and transpose kernels all depend on them.
Cooperative Groups

A C++ API unifying synchronization at thread / warp / block / grid scope. Grid-level sync is the building block for persistent kernels.
CUDA Streams & Events

Streams are GPU work queues — different streams run concurrently. Overlapping compute, H2D, and D2H on three streams is a baseline training / inference skill.
Tensor Cores / MMA

The matrix multiply-accumulate units introduced in Volta. wmma is the C++ API; mma PTX is the lower layer. cuBLAS, CUTLASS, and FlashAttention all lean on them.
Async copy / TMA

Ampere's cp.async bypasses registers when copying global → shared; Hopper's TMA batches it further. Modern high-perf kernels use them by default.
PTX / SASS

PTX is NVIDIA's virtual ISA; SASS is the actual machine code. The last mile of perf debugging often ends up reading one or the other.

Labs

CUDA samples repo

Official samples cover dozens of examples from vectorAdd to cooperative groups and CUDA Graph. Read, tweak, and measure — the fastest on-ramp.
Hand-rolled SGEMM

Simon Boehm's classic 10-step walkthrough from naive matmul to near-cuBLAS performance. Follow along and shared-memory tiling, register blocking, and double buffering become second nature.
Reduction kernel

Mark Harris's classic seven-step optimization. Each step — from naive pairwise add to warp shuffle — exposes one hardware constraint.
GPU Mode lectures

Community-run lecture archive covering warp primitives through FlashAttention and Triton.
Triton tutorials

From vector add all the way to fused attention. Beyond CUDA itself, Triton is the modern starting point for production kernels.
Reimplement FlashAttention

Read the official implementation, then write your own tiled attention. Gives you visceral understanding of IO-aware kernel design.
yalm: CUDA inference kernels in practice

Pure C++/CUDA LLM inference — focus on matmul warp reductions, kernel fusion, attention kernels, and manual unroll / prefetch. Great for internalizing the Nsight-metric → kernel-rewrite loop end to end.

Reading

CUDA C++ Programming Guide

The authoritative reference. Programming model, hardware implementation, and performance guidelines are the three chapters you revisit constantly.
Fast LLM Inference From Scratch

Andrew Chan's long-form companion to yalm — the clearest walkthrough of CUDA inference optimization available: warp reductions, kernel fusion, KV cache quantization, and why hand-written unroll/prefetch beats the compiler output.
CUDA C++ Best Practices Guide

Frames perf work as APOD (Assess / Parallelize / Optimize / Deploy). Skim it once before writing a new kernel to avoid the classic traps.
PMPP (4th ed.)

Hwu & Kirk's GPU programming textbook — parallel thinking through stencil, reduce, scan, and GEMM patterns in one coherent sweep.
PTX ISA reference

Required reading for inline PTX, hand-written mma, or reading disassembly.
CUTLASS

NVIDIA's open GEMM template library. Reading its tile iterator, pipeline, and shape templates is basically studying modern GEMM alongside NVIDIA engineers.
siboehm CUDA MMM

Currently the clearest online walkthrough of CUDA matmul optimization. Spells out the metric and trade-off at every step.

Tools

nvcc

The CUDA compiler driver. Understanding -arch / -code and --ptxas-options=-v (prints register use and occupancy) is the first step of tuning.
Nsight Compute

Kernel-level profiler giving you the roofline, warp stall reasons, and memory chart. First thing to run after you finish a kernel.
Nsight Systems

System-level timeline profiler for CPU / GPU / CUDA stream / NCCL alignment. Primary tool for finding stream dependencies and idle gaps.
Compute Sanitizer

GPU-side equivalent of ASan — catches out-of-bounds, races, and uninitialized memory. Running it in CI saves many memory-trashing nights.
cuda-gdb

gdb for the GPU — set breakpoints inside a running kernel, inspect warp state. Last-resort weapon for diagnosing illegal memory access.
nvidia-smi / DCGM

Baseline command for SM utilization, memory, thermals, and power. DCGM is the cluster-oriented upgrade that emits Prometheus metrics.
NCCL tests

Official benchmark for multi-GPU / multi-node collective bandwidth. Run all_reduce_perf first when diagnosing training comms bottlenecks.

Eng & Observability

Writing the code is only half the job; keeping it alive in production is the other half — containers, K8s, Prometheus, OpenTelemetry, SLOs. This module trains you to deploy a service to a cluster on your own, define sensible SLIs and SLOs, and at 3 a.m. walk from the four golden signals and a trace down to the exact line of code that broke.

After this module you should be able to answer

Containers / K8s basics

Where do image layers and a container's writable layer actually live? Why is an image produced by `docker commit` usually bigger than one built from a Dockerfile?
What scheduler decisions are driven by Pod requests vs. limits? What happens if you set a limit without a request, or the other way around?
Ten Pods sit behind one Service — how does connection distribution differ between kube-proxy's iptables and IPVS modes?
Under what conditions does a Deployment rolling update get stuck? When `kubectl rollout status` says stuck, which resources do you check, in order?
What's the override precedence for Helm values? What changes if you drop the `-` from `{{- if .Values.x -}}` in a template?
Where do init containers, sidecars, and ephemeral containers each fit in the Pod lifecycle?

Advanced K8s patterns

How does the kube-scheduler plugin architecture (filter / score) extend? What's the typical path for deploying a custom scheduler?
How do Pod priority and preemption work? What common traps appear at large scale?
What are the core differences between StatefulSet and Deployment? Why do stateful services usually still need an Operator?
How do you write an idempotent reconcile loop in the Operator pattern? What's the trade-off between level-triggered and edge-triggered designs?
What does the typical CRD + admission webhook extension path look like? At what stage do mutating vs validating webhooks intercept requests?

Releases & GitOps

How does GitOps (ArgoCD / Flux) reconcile 'cluster drifted from Git'? When does auto-sync fit, and when manual?
What are the differences between canary, blue/green, and rolling releases? What does Istio traffic shifting add on top of plain Service + Deployment?
How does the metric-driven rollback loop work in progressive delivery (Flagger / Argo Rollouts)?
In multi-cluster, multi-region deployments, how do ArgoCD ApplicationSet and Flux Kustomization differ as engineering models?

SLO / error budget

You're defining an SLO for a latency-sensitive API: p99 or p999? What is the team obligated to do once the error budget is exhausted?
Mapping the four golden signals (latency, traffic, errors, saturation) to Prometheus, which metric types (counter, gauge, histogram) do you pick for each?
How do multi-window multi-burn-rate SLO alerts compose? Why does a single burn rate produce false positives?
How does 'freeze feature work once error budget is spent' play out in engineering culture? When is an exception justified?
When does histogram_quantile in Prometheus mislead you? Where does summary fit better?

Observability / tracing

How are trace, metric, and log contexts correlated in OpenTelemetry? Given an error log, how do you jump to the matching trace?
What does a production OpenTelemetry Collector pipeline (receivers / processors / exporters) typically look like?
How do you choose between head-based and tail-based distributed tracing sampling? Where does the engineering complexity of tail sampling live?
When logs land in Loki, traces in Tempo, and metrics in Prometheus, which label/attribute conventions make "jump from trace id to logs and metrics" work?
Why is continuous profiling (pprof / Parca / Pyroscope) considered the fourth observability pillar? How does it complement metrics/logs/traces?

Core concepts

Containers and images

Process isolation via namespaces and cgroups; images are stacks of read-only layers. First, accept that a container is not a VM.
K8s scheduling

The scheduler binds Pods to Nodes using resources, affinity, and taints/tolerations. Unavoidable in both interviews and incident response.
Requests and limits

Requests drive scheduling and QoS class; limits set the cgroup ceiling. Most OOMKilled incidents trace back to mistakes here.
Service / Ingress

Service gives you a stable in-cluster VIP with load balancing; Ingress handles L7 entry. You can't debug cluster networking without understanding kube-proxy.
Trace / Metric / Log

The three observability pillars. Metrics say 'is something wrong', logs say 'what happened', traces say 'where in the call chain'.
SLI / SLO / SLA

Indicator, objective, agreement. The error budget is the shared language engineering and product use to balance speed and reliability.
Releases and rollbacks

Deployments handle rolling updates and rollbacks. Combined with readiness probes and PDBs, that's what actually makes releases safe.
Four golden signals

Latency, traffic, errors, saturation. Every service dashboard should start from these four.

Lab

Deploy a toy service on K8s

Write Deployment + Service + Ingress from scratch and run it on kind or minikube.
Local Prometheus + Grafana

Expose /metrics from the toy service, write your first PromQL alert, and build a latency + QPS Grafana dashboard.
Instrument with OpenTelemetry

Add trace and metric instrumentation, ship via OTLP to a collector, feel how propagation and context travel across calls.
Package a Helm chart

Convert hand-written yaml into a Helm chart and internalize the values / template / release model.
Trace call chains with Jaeger

Deploy a multi-service demo, view cross-service trace trees in Jaeger's UI, practice pinpointing slow spans.

Reading

Kubernetes Patterns

Abstracts the recurring designs of K8s into named patterns — Sidecar, Ambassador, Init Container, and friends.
SRE Book

The methodological foundation from Google SRE: SLOs, error budgets, on-call, post-mortems all originate here.
K8s official docs

Unusually good official documentation. Read the Concepts section end to end at least once.
Prometheus docs

Data model, PromQL, recording rules, alertmanager. Internalize the model before you build monitoring.
OpenTelemetry docs

The unified standard for trace, metric, and log. Keep the API / SDK / Collector layers straight.

Tools

kubectl

The Swiss Army knife of K8s. describe, logs, exec, port-forward, debug — five subcommands you'll reach for daily.
Helm

The package manager for K8s. Almost no one deploys raw yaml in production.
Prometheus

The de-facto metrics system. Pull model + labels + PromQL is the foundation of cloud-native monitoring.
Grafana

The dashboard and alerting frontend. Not just for Prometheus — also fronts Loki, Tempo, and assorted databases.
OpenTelemetry

A vendor-neutral standard for observability data. SDK instrumentation plus Collector forwarding is the recommended path today.
Jaeger

Open-source distributed tracing backend. The workhorse for navigating trace trees and pinpointing slow cross-service calls.