AI / Data Infra · Knowledge Map

AI / Data Infra Knowledge Map

A study guide for systems engineers: each chapter has an intro, the questions you should be able to answer, and the concepts, labs, reading, and tools to get you there.

01

Languages · C++ / Rust / Go

C++ implements almost all the infrastructure that matters — databases, browsers, game engines, inference runtimes. Rust is steadily replacing it in databases, browsers, and operating systems. Go dominates cloud-native services, RPC, and CLI tooling. This module isn't about becoming fluent in all three; it's about putting their mental models side by side — how memory and lifetime are managed, how concurrency and async are scheduled, and which toolchain actually gets your code onto the CPU.

After this module you should be able to answer

C++
  1. What problem does RAII actually solve, and why is it considered more thorough than try/finally or defer?
  2. When do you reach for unique_ptr vs shared_ptr vs weak_ptr? shared_ptr's refcount is thread-safe — is the pointee?
  3. What does move semantics buy you over copying? When does an object you expected to move actually get copied, and when do RVO / NRVO eliminate the copy entirely?
  4. Name three common sources of UB (signed overflow, OOB access, strict aliasing). Why is the compiler allowed to delete code because of them?
  5. In the memory model, how do memory_order_relaxed / acquire / release / seq_cst differ? When is relaxed actually safe to use?
  6. What are the engineering trade-offs between exceptions and error codes? Why do Google, LLVM, and many embedded projects disable exceptions entirely?
  7. What does a vtable implementation look like? Where do `final` and `override` actually help the compiler with devirtualization?
  8. Why can two threads updating their own counters still slow each other down? How do you confirm false sharing with perf c2c or cache-miss events?
  9. How do template instantiation and C++20 concepts compare? Why are libraries replacing enable_if with concepts?
  10. What does a C++20 coroutine actually look like under the hood? What roles do promise_type, suspend points, and coroutine_handle play?
  11. How do constexpr, consteval, and `if consteval` differ in compile-time vs runtime behavior?
  12. What does the pimpl idiom buy you on ABI stability and compilation isolation? What is the cost?
Rust
  1. What does ownership + borrow checker forbid that C++ permits? Why does that eliminate use-after-free at compile time?
  2. What do lifetime annotations accomplish? When does Rust force you to write 'a explicitly versus inferring it via the elision rules?
  3. Why does Rust not need a garbage collector? How do RAII and ownership combine to guarantee resource release?
  4. What do the Send and Sync marker traits actually say? Why is Rc<T> not Sync but Arc<T> is?
  5. What is the core of async/await + Future in Rust? Why is Rust async zero-cost yet still needs an executor? What does Pin solve?
  6. What privileges does unsafe Rust unlock, and when is it unavoidable? Which violations does Miri catch?
  7. What is the type-system contract behind interior mutability (Cell / RefCell / Mutex)?
  8. What does a trait-object fat pointer look like? Why can some traits not be `dyn` (what is object safety)?
  9. When do declarative macros fit, and when do procedural macros? Why are serde / tokio impossible without proc-macros?
  10. Why are errors modeled as Result<T, E> rather than exceptions? What does the `?` operator desugar to, and how does it relate to the Try trait?
Go
  1. How light is a goroutine compared to an OS thread? How does the GMP scheduler decide when to preempt, and what did signal-based async preemption (1.14+) solve?
  2. What's the underlying data structure behind a channel (hchan)? How do buffered and unbuffered channels differ in wake-up semantics?
  3. The GC is concurrent mark-and-sweep — what does the write barrier solve? What's a typical STW pause in modern versions?
  4. Why does Go use error values instead of exceptions? Why did errors.Is / errors.As / wrapping only arrive in 1.13?
  5. How is happens-before defined in Go's memory model? What synchronization guarantees do channel send and receive give you?
  6. How should context.Context be used correctly? How does a cancel signal propagate all the way down to a blocking syscall?
  7. What is an interface itab? How much more expensive is an interface method call than a direct function call?
  8. How are generics (1.18+) implemented? Why did Go pick GC shape stenciling over per-type specialization?
  9. What does sync.Pool solve? What happens to its contents during GC?
  10. When are sync.Map and map + Mutex each faster? Why should ordinary code not default to sync.Map?
Cross-language comparisons
  1. How do the three error-handling models (C++ exceptions / Rust Result / Go error) shape API design and ABI stability differently?
  2. What are the runtime costs and mental models of Rust async, C++ coroutines, and Go goroutines? Why did Rust and C++ pick stackless while Go went stackful?
  3. How do RAII (C++), ownership (Rust), and defer (Go) compare as resource-management strategies? Which scenarios does each handle poorly?
  4. How do C++ templates, Rust generics + traits, and Go generics + type constraints differ in compiled output and error messages?
  5. How do the three build systems (CMake / Bazel, cargo, go build) trade off incremental compilation, dependency management, and reproducibility?

C++

Core concepts
  • Tie resource release to object lifetime. It's the foundation C++ resource management is built on — smart pointers, lock guards, and file handles all follow this shape.

  • unique_ptr / shared_ptr / weak_ptr make raw-pointer ownership explicit. Using them correctly eliminates most leaks and double-frees.

  • Transfer expensive resources instead of deep-copying. You can't read modern C++ without understanding rvalue references, std::move, and RVO.

  • const is a read-only contract enforced by the type system, not a comment. Where you put it in an API signature decides what callers can do.

  • Templates power C++'s generics and zero-overhead abstractions. STL containers and algorithms are vocabulary you'll meet in every C++ codebase.

  • Understand what preprocessing, compiling, assembling, and linking each do. Linker errors, ODR violations, and symbol visibility all need this mental model.

  • Undefined behavior isn't a bug — it's a promise the compiler assumes you keep. Knowing the common pitfalls (OOB, uninit, aliasing) is how you avoid them.

  • C++11 gave the language a formal memory model that governs visibility of atomic operations. Required reading before you write any lock-free code.

  • The minimum transfer unit between CPU and memory, usually 64 bytes. Alignment, padding, and hot-field clustering all start here.

  • When threads write different fields on the same cache line, the line ping-pongs between cores. One of the subtlest killers in concurrent code.

  • Modern CPUs keep the pipeline full by guessing the next instruction. Keep hot-path branches predictable, or go branchless.

  • One instruction, many data lanes. The foundation of vectorization and the reason ClickHouse, numpy, and ffmpeg inner loops look the way they do.

  • On multi-socket boxes, local memory is fast and remote is slow. Databases, JVMs, and inference engines all care about NUMA binding.

  • Many 'CPU-bound' workloads are actually DRAM-bound. Use STREAM benchmarks and the roofline model to know your ceiling.

  • Out-of-order, superscalar, retirement — without these you can't reason about IPC, stalls, or port contention.

Lab
  • Write a small allocator

    Hand-roll a bump, freelist, or slab allocator. Forces you to confront alignment, fragmentation, and placement new that libraries usually hide.

  • Implement a hash map

    Skip std::unordered_map and write open-addressing yourself. You'll come away with real intuition for cache locality, rehashing, and load factor.

  • LRU cache

    The classic hash-map-plus-doubly-linked-list combo. Common interview question and the minimum viable prototype for any memory/disk cache.

  • Minimal shell

    A full pass through fork, exec, wait, pipe, and dup2. Afterwards bash or systemd source won't look alien.

  • Build a bytecode interpreter with GC in C from scratch. You'll finally understand how language runtimes, stack frames, and garbage collection work.

  • Curated tutorials for building your own database, Redis, Git, etc. A reliable source of engineering-grade labs.

  • Industrial-strength perf engineering projects, from bit hacks to cache-aware algorithms.

  • The classic warm-up: from the naive triple loop to within 10-100x of BLAS. Touches blocking, vectorization, and threading in one pass.

  • Pick an op (dot product, memcpy, argmax), write AVX2 or AVX-512 intrinsics, compare against the compiler's output.

Reading
Tools
  • The standard Linux debugger. Stacks, breakpoints, registers, core-dump analysis — table stakes for anyone doing systems work.

  • The LLVM-family debugger and macOS default. Commands differ from gdb; worth learning if you live in Clang land.

  • Compile-time instrumentation that catches memory errors, UB, and data races respectively. Running them in CI saves countless debugging nights.

  • Sampling profiler on Linux plus a visualization that makes CPU hotspots obvious. First stop for any performance tuning.

  • Dynamic binary translation that checks memory errors. Slower than ASan but needs no recompile, so it works on shipped binaries.

  • AST-based static checker and auto-fixer from Clang. Good for mass-applying Core Guidelines or Google Style across a codebase.

  • GCC/Clang warning flags that turn the compiler into a free static analyzer. Pair with -Werror for any CI worth its name.

Rust

Core concepts
  • A value has exactly one owner; you may have many shared &T references or one exclusive &mut T, never both. Every lifetime bug class from C++ is forbidden up-front.

  • The compiler needs to know how long references live to prevent dangling. Inferred most of the time, explicit 'a the rest.

  • Like Haskell type classes meets C++ concepts. Zero-cost abstraction and trait bounds both live here.

  • No exceptions — errors go through the type system. The ? operator makes propagation boilerplate-free.

  • Marker traits that tell the compiler which types can cross threads and shared references. Data races are compile-time errors.

  • A Future is a lazy state machine; tokio / async-std are the executors that poll it. Zero-cost, but you still need a runtime.

  • An unsafe block buys you raw pointers, FFI, and low-level data structures. Most of the standard library sits on top of it.

  • Cell / RefCell / Mutex / RwLock express "looks immutable from outside, actually mutable inside" within the type system.

  • match + tagged unions let you make illegal states unrepresentable.

  • Rust's macro system underpins serde, tokio, sqlx and is the official compile-time metaprogramming path.

Lab
Reading
Tools
  • Unified entry point for packaging, building, testing, and benchmarking.

  • Official lint catching dead patterns and style issues.

  • Official formatter. The Rust community basically stopped arguing about style.

  • LSP server that powers IDE completion, hover, and go-to-definition.

  • A MIR interpreter that catches UB in unsafe code — OOB, uninitialized reads, data races.

  • Expands macros into plain Rust. Indispensable when debugging proc-macros.

  • One command generates a flame graph for hotspot localization.

Go

Core concepts
Lab
Reading
Tools
02

Operating Systems

The OS is the floor every piece of infrastructure stands on: how processes get scheduled, how memory is mapped, how I/O is multiplexed — all of it bounds what your services can do. The goal here isn't memorizing concepts but tearing apart a teaching kernel (xv6) that you can actually compile and run, so Linux's behavior has something concrete to compare against.

After this module you should be able to answer

Processes / virtual memory / fork
  1. What do processes and threads share, and what don't they share? How is copy-on-write implemented after fork?
  2. What problem does a multi-level page table solve? Roughly how expensive is a TLB miss, and what is a TLB shootdown?
  3. Walk through how a page fault is handled. What's the difference between a major fault and a minor fault?
  4. From a userspace syscall to kernel execution, what steps happen (trap, context switch, return)?
  5. What happens to copy-on-write pages after fork? Why is fork + exec on a huge-memory process still relatively cheap?
  6. What problem do huge pages and transparent huge pages solve? Why do databases often recommend disabling THP?
  7. What do process states R / S / D / Z mean? Why can't even SIGKILL terminate a process in D state?
  8. At which layer do ASLR, NX, and KASLR each operate, and which class of attacks do they block?
File systems / I/O / multiplexing
  1. How does mmap's performance differ from read/write? When should you reach for mmap, and when is it actually slower?
  2. What does epoll give you over select/poll? When do you pick edge-triggered vs level-triggered?
  3. How does io_uring differ from epoll at its core? What problems does it solve that epoll can't?
  4. Zero-copy syscalls (sendfile, splice, tee, MSG_ZEROCOPY) — which one fits which scenario?
  5. The disk write path is write → page cache → writeback → device queue. Where exactly does fsync vs fdatasync wait?
  6. How do ext4, XFS, and Btrfs differ in journaling, metadata concurrency, and snapshots? What should databases weigh when choosing?
  7. What are the costs and benefits of direct I/O (O_DIRECT) bypassing the page cache? Why does PostgreSQL avoid it while MySQL/InnoDB uses it?
Scheduling / cgroups / containers
  1. How do Linux's CFS and EEVDF schedulers allocate CPU time? What's the relationship between the nice value and cgroup cpu.weight?
  2. What are the core differences between cgroup v1 and v2? Which controllers implement CPU and memory limits inside a container?
  3. How does the OOM killer score processes (oom_score / oom_score_adj)? Why is the biggest memory hog not always killed first?
  4. What's the relationship between futex and a user-space mutex? Why does the uncontended path avoid the kernel almost entirely?
  5. Which namespaces (pid / net / mnt / uts / ipc / user) provide container isolation? Why is the user namespace the linchpin for security?
  6. At which layer do seccomp, capabilities, AppArmor, and SELinux each restrict processes? Which syscalls does the default runc policy block?
  7. What throttling artefact does combining cpu.cfs_period_us and cpu.cfs_quota_us produce? Why do teams often drop CPU limits on latency-sensitive services?
Performance & observability
  1. How would you use strace to diagnose a stuck process? What kinds of problems are better suited to perf sampling vs bpftrace?
  2. How does the USE method (Utilization / Saturation / Errors) work step by step when a box is thrashing?
  3. Why does perf record need to pick among frame pointers / libunwind / DWARF for stack unwinding? How do they differ in overhead and accuracy?
  4. Why is eBPF called "safely running code in the kernel"? What classes of bugs does the verifier prevent?
  5. How does Intel's Top-down methodology (Frontend / Bad Speculation / Backend / Retiring) pinpoint where a hot loop actually stalls?

Core concepts

  • The kernel's two basic scheduling units, one for isolation and one for sharing. Nail down what each owns: address space, file descriptors, signals.

  • Every process sees its own contiguous address space, mapped to physical RAM via page tables and the MMU. Prerequisite for understanding fork, mmap, and OOM.

  • Inodes, directories, journals, the page cache. Once it clicks you can explain why fsync is expensive and why lots of small files are slow.

  • Locks, condition variables, semaphores, lock-free structures. Unavoidable for multithreaded code and the direct target of the xv6 lock lab.

  • Linux's efficient I/O multiplexing, the foundation under Nginx and Redis event loops. Understanding LT vs ET modes is table stakes.

  • Linux's newer async I/O interface that batches syscalls through shared ring buffers. High-throughput storage and network stacks are migrating to it.

  • Maps files or anonymous memory into a process's address space — the usual tool for shared memory and random access on large files. Misuse it and you get SIGBUS and weird write-back behavior.

  • The kernel exposes runtime state as a virtual filesystem. In production triage, /proc/<pid>/maps, status, and stack are the files you open most.

  • Brendan Gregg's triage framework: check Utilization, Saturation, and Errors on every resource. Your first sweep when a box hangs.

  • Intel's Frontend / Bad Speculation / Backend / Retiring four-bucket analysis. Tells you exactly which CPU stage a hot loop is stalling in.

Lab

  • A lab suite built on a teaching Unix-like kernel. Working through it turns syscalls, page tables, locks, and filesystems from words into code you've modified yourself.

  • Write small tools (xargs, find, etc.) using xv6 syscalls. Warm-up lab to get familiar with the workflow.

  • Add a new syscall to the kernel. Walks you through the full trap table, argument passing, and return path.

  • Hand-manipulate RISC-V's three-level page table. Your mental model of address translation moves from PowerPoint to something you can draw at the register level.

  • Refactor coarse kernel locks for better concurrency. Forces you to confront lock contention and the tradeoffs of splitting locks.

  • Implement mmap/munmap inside xv6. After this, Linux mmap behavior stops feeling mysterious.

Reading

  • Three-part structure (virtualization, concurrency, persistence), free online, plainspoken. The best candidate for a primary textbook.

  • Home page for MIT's OS course — notes, videos, and labs all open. Pairs extremely well with OSTEP.

  • The line-by-line companion to xv6's source. Your most-consulted reference while doing the labs.

  • The authoritative news source for Linux kernel development. If you want to follow scheduler, memory, or io_uring evolution, this is the only game in town.

  • Michael Kerrisk's encyclopedic reference on Linux systems programming. When you have a question about syscall behavior, it usually has the answer.

  • The encyclopedia of Linux system performance. CPU, memory, disk, network, all through the USE method.

Tools

  • Traces syscalls and signals of a running process. Stuck process, file not opening, network not connecting — strace first, ask questions later.

  • Strace's library-call counterpart, showing libc and dynamic-library calls. Useful for debugging weird behavior at the glibc layer.

  • Linux's native performance tool for sampling CPU, hardware events, and scheduler latency. The workhorse profiler.

  • A high-level tracing language on top of eBPF — one-liners can observe kernel events. Lighter than strace, more flexible than perf.

  • Infrastructure for safely running sandboxed programs in the kernel; bcc is its Python toolkit. The de facto standard for modern Linux observability.

  • Open-source machine emulator and what xv6 runs on top of. Indispensable for kernel debugging or playing with other architectures (RISC-V, ARM).

  • In-kernel function tracer. Workhorse for syscall, scheduler, and IO stack investigations alongside bpftrace.

  • Folds stack samples into a visualization. First move after perf record.

03

Networking & RPC

The network is the layer every backend engineer uses daily but few have actually read the RFCs for. The goal here is not to turn you into a protocol expert, but to let you read a packet capture, explain what happens in a single HTTPS request, and tell whether a performance issue lives in the handshake, congestion control, or the application layer.

After this module you should be able to answer

TCP / transport
  1. Why does three-way handshake suffice while two-way does not? What is TIME_WAIT in the four-way close actually protecting against?
  2. When does a fresh TCP connection transition from slow start to congestion avoidance? How do Reno and CUBIC react differently to loss?
  3. What are the core differences between BBR and CUBIC? Why is BBR better on long fat pipes but potentially worse on short-connection-dominated workloads?
  4. What latency anti-pattern does Nagle combined with delayed ACK cause? When should TCP_NODELAY be on, and when off?
  5. What's the relationship between MSS, MTU, and Path MTU Discovery? Why does a 1500-MTU link with VPN overhead often stall around the 1400-range?
  6. In Wireshark, how do you visually distinguish retransmission, out-of-order delivery, and zero-window conditions?
  7. How do Linux tcp_wmem / tcp_rmem / tcp_mem cooperate? How do you tune them for high-BDP links?
HTTP / QUIC
  1. What HTTP/1.1 problem does HTTP/2 multiplexing solve? Why is it still vulnerable to TCP head-of-line blocking while HTTP/3 is not?
  2. Why is QUIC built on UDP instead of as a new L4 protocol? How does it implement connection migration?
  3. For a single `curl https://example.com`, which syscalls and network round trips happen in order from DNS lookup to first byte?
  4. What do chunked, keep-alive, and pipelining each solve in HTTP/1.1? Why did pipelining never catch on?
  5. How is HTTP/3 0-RTT resumption related to TLS 1.3 0-RTT? How should applications defend against replays?
TLS / crypto
  1. What round trip did TLS 1.3 eliminate compared to 1.2? What is the security cost of 0-RTT?
  2. What does mTLS add over one-way TLS? What does the typical sidecar-injected mTLS flow look like in a service mesh?
  3. In a TLS handshake, what do ECDHE, certificate verification, and Finished each accomplish? Which step guarantees forward secrecy?
  4. What availability and latency problems do CRL, OCSP, and OCSP stapling each introduce?
RPC / application layer
  1. How does gRPC map bidirectional streams and status codes onto HTTP/2 streams and trailers?
  2. Which HTTP methods are idempotent, and why can naive client retries trigger cascading failures?
  3. How does a gRPC deadline propagate down the call chain? How are metadata, headers, and context carried?
  4. What are the trade-offs between client-side load balancing (lookaside / proxyless xDS) and classic L4/L7 proxies?
Kernel networking / load balancing
  1. What problem does proxy protocol solve? How does the real client IP survive through an L4 load balancer?
  2. What are the typical symptoms of a full ip_conntrack table? When is it better to simply disable conntrack?
  3. How does Cilium / eBPF as a kube-proxy replacement compare to iptables / IPVS on performance and observability?
  4. What's the difference between SO_REUSEPORT and plain bind? When multiple processes listen on the same port, how does the kernel distribute connections?
  5. At which layer do RSS / RPS / RFS / XPS each distribute traffic on a Linux NIC? What do you tune at high packets-per-second?
  6. How do DPDK, XDP, and AF_XDP compare as user-space packet paths on performance and programming model?

Core concepts

  • RFC 9293 is the 2022 consolidated TCP spec. Read it to internalize the SYN/ACK/FIN state machine, not to memorize trivia.

  • Slow start, congestion avoidance, fast retransmit, fast recovery. Almost every weird throughput problem traces back to this diagram.

  • Still the protocol most servers actually handle. Understanding keep-alive, chunked encoding, and pipelining pitfalls is the floor.

  • Introduces binary framing, multiplexing, and HPACK. Needed to read gRPC and to see why one stuck TCP connection stalls all streams.

  • HTTP semantics over QUIC. Focus on how it avoids TCP head-of-line blocking and supports connection migration.

  • Reliable transport plus crypto plus multiplexing, rebuilt on UDP. The new foundation for the modern network stack.

  • Drops a round trip and kills legacy algorithms. Knowing the 1.2 vs 1.3 handshake difference unlocks almost any capture.

  • The first thing in your request path that can fail. Recursive resolution, TTL, and caching are prerequisites for debugging.

  • One page explaining how gRPC maps onto HTTP/2 frames. Faster than reading source.

Lab

Reading

Tools

  • Command-line packet capture. First tool to reach for on a server; invest time in its filter syntax.

  • GUI protocol analyzer. Open a tcpdump capture locally and see TLS and HTTP/2 frames decoded with a click.

  • The standard tool for measuring bandwidth and loss. Run it first when deciding whether a problem is network or app.

  • Traceroute plus ping in one view. Invaluable for diagnosing cross-region or cross-ISP packet loss.

  • The universal HTTP client. With -v and --trace it prints almost everything you might want to inspect.

  • Proxy that injects latency, loss, and bandwidth limits at the TCP layer. Lightweight option for network fault drills.

04

Distributed Systems

Distributed systems is the study of what guarantees you can still offer when machines, networks, and clocks are all free to lie to you. The goal of this module is to build the vocabulary — consistency models, consensus, replication, failure models — so you can read papers, design storage, and reason about production anomalies in shared language.

After this module you should be able to answer

Consistency models & theory
  1. Why is the C in CAP not the same C as in ACID? In a real partition, how do engineers actually trade A against C?
  2. FLP says consensus is impossible in an asynchronous system. Why does Raft work anyway — which assumption does it quietly relax?
  3. Give concrete scenarios that distinguish linearizable, sequential, causal, and eventual consistency from one another.
  4. What's the threshold difference between BFT and CFT (n ≥ 3f+1 vs n ≥ 2f+1)? Outside of blockchains, where else does BFT become a hard requirement?
  5. What extra dimension does PACELC add to CAP? Which side of real-world design does it illuminate better?
Consensus & replication
  1. Why does Raft require a candidate's log to be 'at least as up-to-date as the majority' during election? What goes wrong without it?
  2. How many failures can a 5-node cluster tolerate? What do you gain and lose by going to 7 nodes?
  3. What exact scenario causes 2PC to block? How do Paxos/Raft avoid that class of blocking?
  4. How do Raft's read index and lease read each implement linearizable reads, and what are the costs?
  5. What is joint consensus? Why does Raft use it for config changes instead of an atomic cutover?
  6. What problem do Raft snapshots + log compaction solve? What's the install-snapshot flow when a follower falls too far behind?
  7. How do Multi-Paxos, EPaxos, and Raft differ in latency profile for geo-distributed deployments?
Transactions & MVCC
  1. Which anomalies do transaction isolation levels (RC / RR / SI / SSI / Serializable) each permit? Why isn't Snapshot Isolation serializable?
  2. What's the core idea behind cross-shard transaction models like Percolator and Omid? How do they differ from 2PC?
  3. Under MVCC, how is the snapshot visible to a transaction determined? Why are vacuum/GC unavoidable costs?
  4. How do pessimistic and optimistic locking compare on tail latency in cross-shard transactions? Why does OCC collapse under high contention?
  5. How does deterministic database design (Calvin / FaunaDB) sidestep 2PC? What is the trade-off?
Clocks & time
  1. What guarantee does Spanner's TrueTime provide? How would a team without GPS plus atomic clocks approximate external consistency?
  2. What guarantees do Lamport clocks, vector clocks, and HLC each provide? What role does HLC play inside CockroachDB?
  3. In an eventually consistent system, how do version vectors or CRDTs merge concurrent writes without relying on wall clocks?
  4. What can go wrong in a typical distributed DB if NTP drifts by 50 ms? How short can a leader lease realistically be?
Engineering & failures
  1. What are the trade-offs between consistent hashing and range-based sharding? What problem do virtual nodes (vnodes) solve?
  2. Under what conditions does split-brain happen? How do leases and fencing tokens combine to prevent two 'leaders' from writing concurrently?
  3. What are the five most common classes of consistency bugs in Jepsen reports? Why did early MongoDB keep tripping on them?
  4. Why is gray failure (half-dead nodes) harder than crash failure? Which systems explicitly design around it?
  5. What preconditions do you need before chaos engineering is safe in production? How does it complement Jepsen-style model checking?

Core concepts

  • Under a partition you pick consistency or availability. One of the most abused concepts in the field; clarify what it does and does not claim.

  • No deterministic consensus in an asynchronous network with even one crash. Knowing it explains why Raft needs timeouts.

  • The strongest single-object model: every op appears to take effect atomically at some instant. This is what etcd and ZooKeeper give you.

  • The weak-consistency model behind Dynamo and Cassandra. Jepsen's model map is the clearest reference diagram in the area.

  • The first step of most consensus and replication protocols. Figuring out 'who is in charge' is where every consistency discussion starts.

  • Turn operations into a replayable log, ship it to a majority, then apply to the state machine. The underlying pattern of modern storage.

  • Arithmetic constraints like R + W > N. Understanding them lets you derive the consistency strength of any given configuration.

  • The textbook approach to cross-resource transactions and the textbook cautionary tale about coordinator blocking.

Lab

  • Implement Raft from scratch in Go. Only after finishing this do election, log matching, and safety really click.

  • Build a linearizable KV on top of your Lab 2 Raft. Teaches idempotency and client sessions against duplicate requests.

  • Multiple Raft groups plus shard migration. Closely mirrors production KVs like TiKV and CockroachDB.

  • PingCAP's engineering-flavored distributed systems course. A good complement to the MIT labs with a more practical tilt.

  • Sample projects for a production-grade Go Multi-Raft library. Comparing it to your Lab 2 teaches real engineering nuance.

  • A distributed database running deterministic simulations in the browser. Understanding it shows you the ceiling of modern fault-injection testing.

Reading

  • The original Google File System paper. HDFS and a generation of distributed storage descend from it; read for the pattern, not the details.

  • Kicked off the big-data era. You may not use MR today, but its take on failure handling and retries shaped a whole generation of systems.

  • A consensus algorithm deliberately designed for understandability. Reading it plus doing Lab 2 essentially nails down consensus.

  • Lamport's own 'simple' version. Still brain-bending, but required reading to understand why Raft looks the way it does.

  • Globally distributed strongly consistent database; TrueTime is the key innovation. Every strong-consistency cloud DB today chases it.

  • The blueprint for eventually consistent KVs. NWR, vector clocks, gossip, and consistent hashing all in one place.

  • Kleppmann's Designing Data-Intensive Applications. The only book in this module worth reading cover to cover.

  • The DDIA author's blog. His takes on consistency, clocks, and stream processing are sharper and more current than the book.

Tools

  • Kyle Kingsbury's distributed systems fault-testing framework and blog. Required reading for anyone shipping distributed storage.

  • Production-grade Raft implementation and the backing store for Kubernetes. Reading its source is closer to engineering than reading the paper.

  • The veteran coordination service, built on ZAB. A huge pile of legacy systems rely on it for locks and leader election.

  • Lamport's formal specification language. Use it to model a protocol before you code it and let the checker find bugs for you.

  • Chaos engineering platform for Kubernetes. Inject partitions, node crashes, and latency to verify your system actually survives.

  • Lightweight network fault-injection proxy. Handy for simulating partitions and latency inside integration tests.

05

Data Infra

Data infrastructure is the plumbing between raw writes and analytical queries: how storage engines trade off read vs write amplification, how columnar formats accelerate scans, how stream processors produce correct results under out-of-order events and failures, and how lakehouse table formats bolt ACID onto object stores. Understanding this layer is what lets you reason about modern OLAP, real-time warehouses, and lakehouses instead of just using them.

After this module you should be able to answer

Storage engines (LSM / B+ Tree)
  1. Why are LSM-Trees fast to write but potentially slow to read? What problem is compaction solving, and what new amplification does it introduce?
  2. Compare B+ Tree and LSM-Tree on write amplification, read amplification, and space amplification. Which workloads favor which?
  3. Which write patterns suit RocksDB's leveled, universal, and FIFO compaction? How do you balance the three amplifications?
  4. Why does LSM default to bloom filters? When can a bloom filter actually slow reads down?
  5. RocksDB's LSM compaction amplifies writes, so why does it still beat a B-Tree on SSDs? Which section of compaction_job.cc makes this clearest?
  6. Why did Redis pick single-threaded IO multiplexing over threads? Point to the latency vs throughput trade-off in ae.c and networking.c.
  7. Why does Redis Cluster use fixed 16384 hash slots instead of consistent hashing?
  8. How do deterministic-simulation databases (FoundationDB / TigerBeetle) differ from RocksDB-style engines at the storage layer?
Columnar & query execution
  1. Why does Parquet split data into row group + column chunk + page? When do dictionary encoding and RLE actually pay off, and when do they hurt?
  2. What problem do Dremel's repetition and definition levels solve for nested data? Why can't you just flatten everything into scalar columns?
  3. Why is vectorized execution faster than the Volcano iterator model? Why can ClickHouse outrun Spark SQL by an order of magnitude?
  4. How do ORC and Parquet differ in footer / stripe / bloom filter layouts? Why does Hive lean toward ORC while the Spark ecosystem leans Parquet?
  5. What does Arrow (in-memory columnar) solve for zero-copy data movement compared to Parquet? Where does Flight fit?
  6. What flavor of 'LSM' is ClickHouse's MergeTree? What roles do the primary index, skip index, and parts play?
Stream processing (Kafka / Flink)
  1. How is Kafka exactly-once actually implemented? What is the division of labor between idempotent producer, transactions, and read_committed consumers?
  2. What is a Flink watermark really? When a late event arrives, which knobs (allowedLateness, side output, trigger) decide what happens to the window?
  3. What do Kafka's ISR, leader election, and unclean leader election each mean for consistency?
  4. Once Kafka tiered storage moves cold data to S3, what new metadata and latency problems appear, and how are they solved?
  5. What do Flink's aligned vs unaligned checkpoints each solve, and what are the trade-offs?
  6. What pitfalls come from mixing event time and processing time? How do you design for late data and corrections?
  7. How do Flink's state backends (memory / rocksdb) trade off checkpoint size against recovery time?
Lakehouse / transactions
  1. How do Iceberg and Delta Lake achieve ACID on an object store that only offers PUT? Walk through snapshots, manifests, and how commit conflicts are resolved.
  2. What does a 'transaction' on S3-backed Lakehouse actually mean? How do Iceberg, Delta, and Hudi differ in resolving commit conflicts?
  3. How does Iceberg's three-tier metadata (metadata JSON → manifest list → manifests) enable time travel and partition evolution?
  4. Which read/write ratios fit Hudi's CoW vs MoR tables? How does Hudi MoR compare to Iceberg MoR on merge strategy?
MVCC / schema evolution / CDC
  1. Under MVCC, how is the snapshot visible to a transaction determined? Why are vacuum/GC unavoidable costs of MVCC?
  2. Why does CDC (Debezium et al.) parse binlog / WAL directly? How does the downstream stay stable across upstream schema changes?
  3. What compatibility rules do Avro, Protobuf, and JSON each define for schema evolution (add / drop / retype fields)?
  4. When do Debezium's snapshot + incremental mode vs log-only mode each fit?
  5. What is the core value of SQL-native ETL tools like dbt over traditional Airflow + hand-written SQL? How do they handle testing and lineage?

Core concepts

  • The write-optimized, level-structured store. Memtable/SSTable/compaction is the shared foundation of every modern KV engine (RocksDB, Cassandra, ScyllaDB).

  • The default index of classic relational databases. Read it alongside LSM to see why OLTP and KV systems make different engineering trade-offs.

  • The de-facto columnar file format for analytics. Knowing its row group / page / encoding layout is why OLAP scans beat row stores by an order of magnitude.

  • The idea behind Parquet's and BigQuery's nested columnar storage. Repetition/definition levels are the trick for columnar-izing JSON-like data.

  • What lets Postgres, InnoDB, and most modern OLTP engines keep reads and writes from blocking each other. Understanding snapshots and visibility is table stakes for reading their code.

  • The canonical append-only log + segment + index design. It is the substrate for stream processing, CDC, and event-sourced architectures.

  • The mechanism for reasoning about out-of-order events in a stream. Without it, event-time windows and late-data handling make no sense.

  • One of the most misunderstood terms in streaming. This post ties idempotent producer, transactions, and consumer isolation into one coherent story.

  • One of the two dominant open lakehouse specs. Its snapshot/manifest/metadata layout is what enables schema evolution and time travel on object storage.

  • Databricks's competing lakehouse protocol. Comparing it to Iceberg highlights different trade-offs in transaction-log design.

Lab

  • Buffer pool, B+ Tree, concurrency control, transactions — the best open course for systematically building a database kernel from scratch.

  • Implement memtable, SSTable, and compaction step by step in Rust. Nothing teaches LSM internals like writing one.

  • Dozens of lines of code, but it shows up in LSM, databases, and caches everywhere. Writing one cements the hash-count vs false-positive trade-off.

  • Run the official quickstart end to end to build a real mental model of topic/partition/offset before tackling stream processing.

  • The smallest runnable DataStream example. Use it to feel how keyBy, windows, and watermarks actually compose.

  • Embedded OLAP that runs vectorized queries over Parquet on your laptop. Running EXPLAIN on real datasets makes query plans concrete fast.

  • Start with ae.c, networking.c, t_string.c to see the single-threaded event loop and the memory packing tricks in SDS and ziplist.

  • Industrial reference for LSM-trees. Read LevelDB for the skeleton, then jump to RocksDB for compaction and prefix bloom in production.

  • A textbook for columnar storage plus vectorized execution. Study the inner loops of AggregatingTransform and ColumnVector.

Reading

  • The founding paper for nested columnar storage and interactive SQL engines, and the intellectual core of BigQuery.

  • A landmark attempt at HTAP columnar storage; clearly explains why OLAP column stores struggle with low-latency random writes.

  • The seminal columnar OLAP paper and ancestor of Vertica. Its projection/compression/sort-column ideas are still in use today.

  • The 2011 original design paper. Short, but it lays out the trade-offs of a log-centric architecture clearly.

  • The reference for TrueTime and globally consistent transactions. It directly shaped CockroachDB, YugabyteDB, and others.

  • Kleppmann's Designing Data-Intensive Applications, the single most recommended survey book in the data-systems space.

Tools

  • A single-node vectorized OLAP engine. Great for local Parquet/CSV analysis and a fun target for studying columnar execution.

  • A production-grade columnar OLAP database with best-in-class query speed. Its MergeTree source code is a goldmine of engineering tricks.

  • The most widely embedded LSM KV engine in industry — MySQL, TiDB, CockroachDB, Kafka Streams, and many more rely on it.

  • kafka-console-producer/consumer/topics are your first line of defense when debugging a Kafka cluster.

  • The de-facto batch engine and the main compute layer in the lakehouse stack. Catalyst/Tungsten are how you learn modern SQL optimizers.

  • The de-facto stream processor and the industrial reference for state, checkpoints, and exactly-once execution.

06

AI Infra

AI infrastructure is about how models actually run on hardware: how training shards parameters, gradients, and optimizer state across GPUs and nodes; how inference keeps the KV cache and scheduler efficient; and how compilers lower operator graphs onto CUDA cores and Tensor Cores. This layer is what explains why the same model has wildly different latency across frameworks, and where cost optimization actually lives.

After this module you should be able to answer

GPU & CUDA basics
  1. How do CUDA threads / warps / blocks / grids map onto SMs? Why are block sizes usually multiples of 128 or 256?
  2. What is memory coalescing, and how many times can a non-coalesced warp access waste bandwidth?
  3. What are the rough bandwidth and latency gaps between HBM, L2, SMEM, and registers? Which level does a "slow kernel" usually bottleneck on?
  4. Under the roofline model, what does it mean for a kernel to be compute-bound vs memory-bound, and how does that change the optimization direction?
Training parallelism
  1. What exactly do data, tensor, and pipeline parallelism partition? Why must a 70B training run combine all three rather than rely on DP alone?
  2. What do ZeRO stages 1/2/3 shard? How does ZeRO relate to FSDP, and how much does communication cost grow compared to vanilla DP?
  3. Where does the pipeline-parallel bubble come from? How do 1F1B, interleaved 1F1B, and zero-bubble schemes each shrink it?
  4. What gap does sequence / context parallelism fill that tensor parallelism can't cover?
  5. How is compute/comm overlap actually achieved? Which operators most often end up on the critical path when you combine NCCL streams and buffers?
  6. Gradient checkpointing trades what for what? What's the typical ratio between saved memory and extra compute?
Inference optimization
  1. What is the formula for KV cache memory, and why does it exceed the model weights for long-context inference?
  2. What specific problem does PagedAttention solve for KV cache, and why does it push memory utilization from 20-40% to 90%+?
  3. How does continuous batching differ from static batching, and why is the throughput win so large specifically for LLM inference?
  4. Why is FlashAttention fast? Is it an algorithmic win or a memory-access win, and why does it claim to leave attention's math unchanged?
  5. How do prefill and decode phases differ computationally? How much throughput does PD-disaggregated serving actually gain?
  6. How do Medusa, EAGLE, and Lookahead generate draft tokens in speculative decoding? What caps their achievable speedup?
  7. How much KV can prefix caching save when system prompts and multi-turn dialogs are shared? How does that relate to SGLang's radix tree?
Precision & quantization
  1. What are the numerical-range differences between fp16, bf16, and fp8? Why does fp16 need loss scaling when bf16 usually doesn't?
  2. How do INT8, INT4, AWQ, and GPTQ quantization schemes differ? Where does accuracy degradation become unacceptable?
  3. How much does fp8 training (Hopper E4M3 / E5M2) save over bf16, and what are the convergence risks?
  4. How does quantizing the KV cache to INT8 / INT4 affect long-context inference latency, and which architectures are most prone to accuracy loss?
Model architecture (MoE / GQA)
  1. How do LoRA / QLoRA trade memory for quality vs full fine-tuning? Which layers give the best bang for the buck?
  2. What are the three hardest problems in MoE expert-parallel implementations (routing, all-to-all, load imbalance)?
  3. What do GQA / MQA save compared to MHA? Why do nearly all new models pick GQA?
  4. How do DeepSeek's fine-grained experts + shared experts differ in engineering from Mixtral's top-2 routing?
  5. What does multi-head latent attention (MLA) add beyond GQA on KV-cache compression?
Metrics & evaluation
  1. What's the difference between MFU and HFU? What levels does industrial training usually hit?
  2. How does vLLM's PagedAttention eliminate KV-cache fragmentation, and which OS mechanism is it borrowing from?
  3. How should you decompose an inference SLO across TTFT, TPOT, and end-to-end latency when request sizes vary wildly?

Core concepts

  • The foundational abstraction of GPU programming. Without it, kernel tuning and occupancy are just magic numbers.

  • A warp is the minimal scheduling/execution unit on a GPU; shared memory is the fast intra-block channel. Together they are the core tools for high-performance kernels.

  • Fusing 32 threads of a warp into one memory transaction is a prerequisite for saturating GPU bandwidth.

  • The Megatron-LM systems paper is the clearest treatment of combining all three. You cannot train modern LLMs without it.

  • Shards optimizer state, gradients, and parameters across a data-parallel group — the theoretical basis of DeepSpeed and FSDP.

  • The per-token state that makes autoregressive generation tractable. It sets the floor of LLM inference memory and the ceiling of scheduling.

  • vLLM's key contribution: apply OS-style virtual-memory paging to the KV cache and kill fragmentation.

  • Form batches at token granularity rather than waiting for whole requests. Typically 2–10x throughput for LLM serving.

  • A small draft model proposes, the big model verifies in parallel. The mainstream way to cut latency without changing model quality.

  • Uses tiling and recomputation to minimize HBM traffic for attention. One of the most important kernel-level wins of the past few years.

Lab

  • Tianqi Chen's deep-learning-systems course. You build from autograd up to CUDA kernels and see a DL framework end to end.

  • The classic parallel-computing course with labs in ISPC, CUDA, and MPI. The best way to build a GPU mental model.

  • Karpathy's pure-C/CUDA GPT training. Reading it turns PyTorch from a black box into a glass box.

  • Andrew Chan's pure C++/CUDA single-GPU inference engine with zero external deps. Hits 63.8 tok/s on Mistral-7B, matching or beating llama.cpp. If llm.c is training, yalm is inference — the complementary project on the same axis.

  • Sasha Rush's 14 interactive puzzles targeting warps, shared memory, and reductions. A few hours well spent for GPU intuition.

  • ~100 lines of Python that implement backprop. After this, PyTorch's computation graph stops feeling magical.

  • OpenAI's GPU DSL with a much lower barrier than CUDA. The lingua franca for FlashAttention and fused kernels today.

  • Open-source implementation of PagedAttention and continuous batching. Primary source for LLM inference performance.

Reading

  • Vaswani et al.'s 2017 transformer paper. The shared foundation under all of LLM infrastructure — self-attention, multi-head, and positional encoding all come from here.

  • NVIDIA's 2017 fp16 training paper. Loss scaling and master weights — the defaults in every modern training framework — originate here.

  • Google's founding paper on pipeline parallelism: slice the model by layers and fill the bubble with micro-batches. Every subsequent schedule — 1F1B, interleaved 1F1B, zero-bubble — is an iteration on it.

  • NVIDIA's large-model training systems paper and the canonical source for tensor parallelism.

  • Megatron team's SC'21 follow-up that combines data / tensor / pipeline parallelism on thousands of A100s. The de-facto reference architecture for modern LLM training — virtually every team starts from its blueprint.

  • DeepSpeed's sharded-optimizer paper and the theoretical basis for FSDP and modern large-scale training.

  • Google scaling MoE to trillion parameters. Its design choices — Top-1 routing, load-balancing loss, expert capacity — shaped an entire generation of MoE models including Mixtral and DeepSeek MoE.

  • A textbook example of IO-aware kernel design. Reading it teaches how HBM vs SRAM bandwidth dictates kernel structure.

  • Tri Dao's v2: re-partitions thread blocks and cuts non-matmul work, giving another 2x on attention kernels on A100. The default attention impl in virtually every inference framework today.

  • Hopper-specific redesign using wgmma, asynchronous TMA, and warp specialization to hit near hardware speed-of-light on H100 for fp16/fp8 attention.

  • The foundational paper on continuous batching + iteration-level scheduling. Every LLM scheduler in vLLM, TensorRT-LLM, and SGLang traces its core idea back here.

  • The milestone systems paper for LLM inference. It formalizes scheduling and memory management as first-class problems.

  • The canonical paper on prefill/decode disaggregation. Pushes a new Pareto frontier between latency and throughput by decoupling the two phases — one of the most important architectural shifts in current inference stacks.

  • Leviathan et al.'s 2022 original paper. The 'small drafter + large verifier in parallel' paradigm — Medusa, EAGLE, and Lookahead are all descendants.

  • Single-pass INT4 post-training quantization with negligible accuracy loss. Almost every INT4 weight in the llama.cpp / HuggingFace ecosystem goes through this.

  • Activation-aware weight quantization: identify the 'salient' channels from activation statistics and keep them in higher precision. Splits the production INT4 market with GPTQ.

  • The LLVM team's new IR infrastructure aimed at AI and heterogeneous compilers. Essential background for understanding modern compiler stacks.

  • The seminal deep-learning compiler paper. Its schedule/compute split deeply influenced Triton and the Halide community.

  • Andrew Chan's long-form companion to yalm. Builds from naive code up to production-competitive performance, covering OpenMP + AVX, warp reductions, kernel fusion, attention kernels, KV cache quantization, and manual unrolling and prefetching. The clearest single-node writeup on inference optimization in recent years.

Tools

  • The de-facto deep-learning framework — virtually every open-source model ships as a PyTorch checkpoint. torch.compile, FSDP2, and DTensor are the current pillars of the ecosystem.

  • Google's functional DL framework on top of XLA. Shines on TPUs and at large-scale training (Gemini, Mixtral training, etc.).

  • NVIDIA's open-source training framework. Reference implementation of tensor / pipeline / sequence parallelism, and the common starting point for GPT-3-scale training and above.

  • PyTorch team's native 4D-parallel training library (2024+). Skips Megatron by building directly on DTensor / FSDP2 — PyTorch's own reference training stack.

  • Microsoft's training optimization library and the primary implementation of ZeRO. A common choice for large-model training.

  • NVIDIA's fp8 training / inference kernel library and the de-facto standard for fp8 on Hopper and Blackwell. Tightly integrated with Megatron-LM.

  • HuggingFace's thin wrapper over multi-GPU, mixed precision, and FSDP. The easiest path from single-GPU to multi-GPU training.

  • Official implementation of the FlashAttention papers (v1 / v2 / v3). Directly integrated into PyTorch 2.2+ SDPA.

  • Meta's high-performance transformer operators — memory-efficient attention, SwiGLU, ALiBi, and more.

  • An order of magnitude easier than CUDA for writing GPU kernels. The main implementation language for production fused attention and quantization kernels.

  • The de-facto LLM inference engine today and the industrial reference for PagedAttention plus continuous batching.

  • NVIDIA's flagship LLM inference engine. At high concurrency it runs 30-50% faster than vLLM, paid for with a heavier compilation step and narrower model coverage. Powers most large cloud serving stacks.

  • The rising star: RadixAttention keeps prefix-cache hit rates very high, and it often beats vLLM on newer architectures like DeepSeek-V3. The first pick for structured generation and multi-turn workloads.

  • Pure C/C++ local inference engine running on Mac Metal, CPU, CUDA, and Vulkan. GGUF + INT4 quantization is the de-facto standard for consumer-grade local LLMs.

  • HuggingFace's inference server — once the most popular production option. Now in official maintenance mode with vLLM/SGLang recommended instead, but still widely deployed.

  • De-facto runtime for int8 / nf4 quantization. Behind QLoRA, HuggingFace Transformers' load_in_8bit / load_in_4bit, and most consumer-grade quantization.

  • Custom Triton kernels for LoRA / QLoRA delivering 2x speed and 50%+ VRAM reduction over plain HuggingFace. The first choice for fine-tuning on consumer GPUs.

  • Distributed Python computing — Ray Train / Ray Serve / Ray Data form a full ML infrastructure stack. Multi-node vLLM and SGLang both run on top of it.

  • The first tool to reach for when analyzing training/inference performance. Gives per-op CPU/GPU time and memory allocations.

  • NVIDIA's official GPU profiler. Kernel timing, SM occupancy, and memory-bandwidth bottlenecks all live here.

07

CUDA / GPU Programming

A GPU isn't a 'many-core CPU' — it's a throughput-oriented massively parallel machine: thousands of registers per SM, hundreds of KB of shared memory, tens of thousands of threads in flight. This module pulls apart CUDA's execution model, memory hierarchy, and sync primitives from the hardware angle, so you can write your own kernels, read the inner loops of CUTLASS and FlashAttention, and know from a single Nsight Compute metric where to optimize.

After this module you should be able to answer

Execution model & warps
  1. What does a kernel launch go through between the host call and the GPU starting execution (driver, runtime, queue, command processor, SM)? Roughly what fixed overhead does one launch cost?
  2. How many instructions can an SM's warp scheduler issue per cycle? Is 100% occupancy always fastest — why did occupancy become less critical after Volta?
  3. What is warp divergence? If 16 threads in a warp take the if and 16 take the else, how does the hardware execute it, and what is the cost?
  4. When do Cooperative Groups and grid.sync fit? How does combining them with persistent kernels eliminate launch overhead?
  5. Where do CUDA Graph gains over per-launch dispatch actually come from? When is there no meaningful win?
Memory hierarchy & access
  1. What causes a shared-memory bank conflict? When 32 threads in a warp hit different addresses in the same bank, how many serialized transactions does it become?
  2. What are the rough latencies and bandwidths of register / shared / L1 / L2 / HBM? Given a data-reuse pattern, where would you place the data?
  3. With ~192 KB combined shared memory and L1 on an SM, how do you configure a kernel to avoid register spills into local memory?
  4. How much bandwidth do vectorized loads (float4, ldmatrix) recover? Why is it almost mandatory for writing GEMM?
  5. What's the cost of unified memory (cudaMallocManaged) page migration? Which workloads should fall back to explicit cudaMemcpy?
  6. What does async copy (cp.async) save compared to the classic global → shared path through registers? How does Hopper extend it over Ampere?
Tensor Cores / MMA
  1. What's the difference between a Tensor Core and a regular CUDA core? When do you use the wmma API versus writing mma PTX directly?
  2. How does CUTLASS 3.x's CuTe layout abstraction differ qualitatively from the 2.x tile iterator?
  3. How does wgmma (Hopper) differ from mma (Ampere) in execution granularity and asynchrony?
  4. Why is FlashAttention-3 another 1.5-2x faster on Hopper? Which features of warp specialization and wgmma does it exploit?
Multi-GPU communication
  1. How is concurrency across cudaStreams implemented? What are the relative costs of event, graph, and barrier synchronization?
  2. Why does NCCL ring all-reduce hit near peak NVLink bandwidth? At 64-GPU scale, would a tree algorithm do better?
  3. What are the rough bandwidth/latency levels of NVSwitch, NVLink, PCIe, and InfiniBand? How should a training cluster be topologically organized?
  4. What does SHARP (NVIDIA's in-network reduction) save over traditional ring all-reduce, and what are the deployment constraints?
New hardware (Hopper / Blackwell)
  1. What new capabilities do Hopper's thread block cluster and distributed shared memory unlock?
  2. What flavors of TMA swizzling exist? Why must you pick the right one to pair with wgmma?
  3. What extra precision constraints do fp8 tensor cores (Hopper FP8 / Blackwell FP4) impose on training and inference?
  4. What does Blackwell's 2nd-gen Transformer Engine plus FP4 tensor core add on top of Hopper?
  5. What does MPS (Multi-Process Service) solve? How does it differ from MIG on usage and isolation?
Tuning tools
  1. How do you read Nsight Compute's Speed of Light (SOL) metric? What root causes do 'long scoreboard', 'short scoreboard', and 'barrier' stalls indicate?
  2. On the Nsight Systems timeline, how do you tell 'CPU launching too slowly' apart from 'GPU actually idle'?
  3. Which fields of nvcc's --ptxas-options=-v output are most useful when tuning occupancy and register budget?

Core concepts

  • The execution units, register file, and L1 / shared layout inside a single SM drive kernel design. Knowing SM count, warp width, and register budget is the starting point for all tuning.

  • Which SM a block lands on and how warps get scheduled determine occupancy and latency hiding. Not just a concept — see how it maps onto the hardware.

  • Register / shared / L1 / L2 / global / HBM differ by 2–3 orders of magnitude in bandwidth and latency. 90% of GPU perf work is answering 'which tier does this data belong to?'

  • Merging 32 threads of a warp into a single global memory transaction is the precondition for saturating HBM bandwidth.

  • Shared memory is split into 32 banks; different words in the same bank serialize. Padding and swizzling are the two standard workarounds.

  • shfl_sync / ballot_sync / reduce_sync and friends. High-performance reduce, scan, and transpose kernels all depend on them.

  • A C++ API unifying synchronization at thread / warp / block / grid scope. Grid-level sync is the building block for persistent kernels.

  • Streams are GPU work queues — different streams run concurrently. Overlapping compute, H2D, and D2H on three streams is a baseline training / inference skill.

  • The matrix multiply-accumulate units introduced in Volta. wmma is the C++ API; mma PTX is the lower layer. cuBLAS, CUTLASS, and FlashAttention all lean on them.

  • Ampere's cp.async bypasses registers when copying global → shared; Hopper's TMA batches it further. Modern high-perf kernels use them by default.

  • PTX is NVIDIA's virtual ISA; SASS is the actual machine code. The last mile of perf debugging often ends up reading one or the other.

Labs

  • Official samples cover dozens of examples from vectorAdd to cooperative groups and CUDA Graph. Read, tweak, and measure — the fastest on-ramp.

  • Simon Boehm's classic 10-step walkthrough from naive matmul to near-cuBLAS performance. Follow along and shared-memory tiling, register blocking, and double buffering become second nature.

  • Mark Harris's classic seven-step optimization. Each step — from naive pairwise add to warp shuffle — exposes one hardware constraint.

  • Community-run lecture archive covering warp primitives through FlashAttention and Triton.

  • From vector add all the way to fused attention. Beyond CUDA itself, Triton is the modern starting point for production kernels.

  • Read the official implementation, then write your own tiled attention. Gives you visceral understanding of IO-aware kernel design.

  • Pure C++/CUDA LLM inference — focus on matmul warp reductions, kernel fusion, attention kernels, and manual unroll / prefetch. Great for internalizing the Nsight-metric → kernel-rewrite loop end to end.

Reading

  • The authoritative reference. Programming model, hardware implementation, and performance guidelines are the three chapters you revisit constantly.

  • Andrew Chan's long-form companion to yalm — the clearest walkthrough of CUDA inference optimization available: warp reductions, kernel fusion, KV cache quantization, and why hand-written unroll/prefetch beats the compiler output.

  • Frames perf work as APOD (Assess / Parallelize / Optimize / Deploy). Skim it once before writing a new kernel to avoid the classic traps.

  • Hwu & Kirk's GPU programming textbook — parallel thinking through stencil, reduce, scan, and GEMM patterns in one coherent sweep.

  • Required reading for inline PTX, hand-written mma, or reading disassembly.

  • NVIDIA's open GEMM template library. Reading its tile iterator, pipeline, and shape templates is basically studying modern GEMM alongside NVIDIA engineers.

  • Currently the clearest online walkthrough of CUDA matmul optimization. Spells out the metric and trade-off at every step.

Tools

  • The CUDA compiler driver. Understanding -arch / -code and --ptxas-options=-v (prints register use and occupancy) is the first step of tuning.

  • Kernel-level profiler giving you the roofline, warp stall reasons, and memory chart. First thing to run after you finish a kernel.

  • System-level timeline profiler for CPU / GPU / CUDA stream / NCCL alignment. Primary tool for finding stream dependencies and idle gaps.

  • GPU-side equivalent of ASan — catches out-of-bounds, races, and uninitialized memory. Running it in CI saves many memory-trashing nights.

  • gdb for the GPU — set breakpoints inside a running kernel, inspect warp state. Last-resort weapon for diagnosing illegal memory access.

  • Baseline command for SM utilization, memory, thermals, and power. DCGM is the cluster-oriented upgrade that emits Prometheus metrics.

  • Official benchmark for multi-GPU / multi-node collective bandwidth. Run all_reduce_perf first when diagnosing training comms bottlenecks.

08

Eng & Observability

Writing the code is only half the job; keeping it alive in production is the other half — containers, K8s, Prometheus, OpenTelemetry, SLOs. This module trains you to deploy a service to a cluster on your own, define sensible SLIs and SLOs, and at 3 a.m. walk from the four golden signals and a trace down to the exact line of code that broke.

After this module you should be able to answer

Containers / K8s basics
  1. Where do image layers and a container's writable layer actually live? Why is an image produced by `docker commit` usually bigger than one built from a Dockerfile?
  2. What scheduler decisions are driven by Pod requests vs. limits? What happens if you set a limit without a request, or the other way around?
  3. Ten Pods sit behind one Service — how does connection distribution differ between kube-proxy's iptables and IPVS modes?
  4. Under what conditions does a Deployment rolling update get stuck? When `kubectl rollout status` says stuck, which resources do you check, in order?
  5. What's the override precedence for Helm values? What changes if you drop the `-` from `{{- if .Values.x -}}` in a template?
  6. Where do init containers, sidecars, and ephemeral containers each fit in the Pod lifecycle?
Advanced K8s patterns
  1. How does the kube-scheduler plugin architecture (filter / score) extend? What's the typical path for deploying a custom scheduler?
  2. How do Pod priority and preemption work? What common traps appear at large scale?
  3. What are the core differences between StatefulSet and Deployment? Why do stateful services usually still need an Operator?
  4. How do you write an idempotent reconcile loop in the Operator pattern? What's the trade-off between level-triggered and edge-triggered designs?
  5. What does the typical CRD + admission webhook extension path look like? At what stage do mutating vs validating webhooks intercept requests?
Releases & GitOps
  1. How does GitOps (ArgoCD / Flux) reconcile 'cluster drifted from Git'? When does auto-sync fit, and when manual?
  2. What are the differences between canary, blue/green, and rolling releases? What does Istio traffic shifting add on top of plain Service + Deployment?
  3. How does the metric-driven rollback loop work in progressive delivery (Flagger / Argo Rollouts)?
  4. In multi-cluster, multi-region deployments, how do ArgoCD ApplicationSet and Flux Kustomization differ as engineering models?
SLO / error budget
  1. You're defining an SLO for a latency-sensitive API: p99 or p999? What is the team obligated to do once the error budget is exhausted?
  2. Mapping the four golden signals (latency, traffic, errors, saturation) to Prometheus, which metric types (counter, gauge, histogram) do you pick for each?
  3. How do multi-window multi-burn-rate SLO alerts compose? Why does a single burn rate produce false positives?
  4. How does 'freeze feature work once error budget is spent' play out in engineering culture? When is an exception justified?
  5. When does histogram_quantile in Prometheus mislead you? Where does summary fit better?
Observability / tracing
  1. How are trace, metric, and log contexts correlated in OpenTelemetry? Given an error log, how do you jump to the matching trace?
  2. What does a production OpenTelemetry Collector pipeline (receivers / processors / exporters) typically look like?
  3. How do you choose between head-based and tail-based distributed tracing sampling? Where does the engineering complexity of tail sampling live?
  4. When logs land in Loki, traces in Tempo, and metrics in Prometheus, which label/attribute conventions make "jump from trace id to logs and metrics" work?
  5. Why is continuous profiling (pprof / Parca / Pyroscope) considered the fourth observability pillar? How does it complement metrics/logs/traces?

Core concepts

  • Process isolation via namespaces and cgroups; images are stacks of read-only layers. First, accept that a container is not a VM.

  • The scheduler binds Pods to Nodes using resources, affinity, and taints/tolerations. Unavoidable in both interviews and incident response.

  • Requests drive scheduling and QoS class; limits set the cgroup ceiling. Most OOMKilled incidents trace back to mistakes here.

  • Service gives you a stable in-cluster VIP with load balancing; Ingress handles L7 entry. You can't debug cluster networking without understanding kube-proxy.

  • The three observability pillars. Metrics say 'is something wrong', logs say 'what happened', traces say 'where in the call chain'.

  • Indicator, objective, agreement. The error budget is the shared language engineering and product use to balance speed and reliability.

  • Deployments handle rolling updates and rollbacks. Combined with readiness probes and PDBs, that's what actually makes releases safe.

  • Latency, traffic, errors, saturation. Every service dashboard should start from these four.

Lab

Reading

  • Abstracts the recurring designs of K8s into named patterns — Sidecar, Ambassador, Init Container, and friends.

  • The methodological foundation from Google SRE: SLOs, error budgets, on-call, post-mortems all originate here.

  • Unusually good official documentation. Read the Concepts section end to end at least once.

  • Data model, PromQL, recording rules, alertmanager. Internalize the model before you build monitoring.

  • The unified standard for trace, metric, and log. Keep the API / SDK / Collector layers straight.

Tools

  • The Swiss Army knife of K8s. describe, logs, exec, port-forward, debug — five subcommands you'll reach for daily.

  • The package manager for K8s. Almost no one deploys raw yaml in production.

  • The de-facto metrics system. Pull model + labels + PromQL is the foundation of cloud-native monitoring.

  • The dashboard and alerting frontend. Not just for Prometheus — also fronts Loki, Tempo, and assorted databases.

  • A vendor-neutral standard for observability data. SDK instrumentation plus Collector forwarding is the recommended path today.

  • Open-source distributed tracing backend. The workhorse for navigating trace trees and pinpointing slow cross-service calls.