A Tour of NVIDIA's GPU Programming Stack — From PTX to CuTe DSL
Looking at NVIDIA’s GPU programming ecosystem from the outside in 2026, it feels increasingly chaotic: PTX, CUDA C++, CUTLASS, CuTe, Triton, CuTe DSL, CuTile — names keep showing up in papers, blogs, and PR descriptions, and new ones are still being added. But place them on a continuous ladder from hardware to humans and the whole landscape clicks into focus — the lower you go, the closer to the machine, the higher the achievable performance, the more painful to write; the higher you go, the more it feels like plain Python, the faster the iteration, but the less control you have over the hardware.
This article uses that ladder as the spine to make the relationships clear: which tool lives at which layer, how they actually relate, how real production systems (vLLM and SGLang) mix them together, and — when you actually need to write a piece of GPU code — how to choose between them.
The Abstraction Ladder — five layers from assembly to Python DSL
Let’s first sort the pile of names by abstraction level from bottom to top. The diagram below is the cheat sheet for everything that follows — each layer’s role, representative tools, and place in the ecosystem; remember this skeleton first.
Walking down the ladder layer by layer:
Layer 5 · Python DSL (Triton / CuTile / CuTe DSL)
The highest abstraction. The shared trait: written in Python, with their own independent compilers that lower the high-level code directly to PTX — no CUDA C++ in between. This is the most important paradigm shift since 2019 — it pulled “writing GPU kernels” out of the C++ world.
- Triton (OpenAI, 2019) — by far the dominant entry today. The core idea is block/tile-level programming rather than thread-level: you schedule in tile units, and memory coalescing, shared-memory synchronization, warp allocation are all handed off to the compiler.
- CuTe DSL (NVIDIA, 2025) — a Python frontend over the underlying CuTe layout algebra; the goal is “the developer experience of Python with performance approaching CUTLASS C++.”
- CuTile / CUDA Tile IR (NVIDIA, 2025) — the official tile programming model introduced in CUDA 13.1, NVIDIA’s “we’ll do Triton ourselves” response.
Layer 4 · C++ template performance layer (CUTLASS / CuTe / ThunderKittens)
The motivation here is: writing main-event operators like matmul / attention in raw CUDA C++ is too painful and too error-prone, so high-performance patterns are packaged into reusable abstractions via C++ templates.
- CUTLASS (NVIDIA) —
CUDA Templates for Linear Algebra Subroutines; the name is a nod to the older closed-source library cuBLAS, and you can think of it as “an open-source, customizable BLAS.” - CuTe (NVIDIA, from CUTLASS 3.0) —
CUDA Tensors, the layout algebra inside CUTLASS that describes “how data is laid out + how threads map to data.” It’s the foundation of CUTLASS, not a parallel library. - ThunderKittens (Stanford Hazy Research, 2024) — takes the opposite tack, asking how far a small set of “opinionated” abstractions can go. On H100 for GEMM / attention, it matches or beats CUTLASS-based FlashAttention-3 with far less code.
Layer 3 · CUDA C++ (.cu)
The foundation. The only option since 2012, and still the root of the entire ecosystem. The model is SIMT — you organize parallelism explicitly across grid → block (CTA) → warp → thread, and you manage registers, shared memory, tiling, and synchronization yourself. Every upper-layer framework eventually falls back to it; every library is written on top of it.
In everyday usage “CUDA” usually refers to this language, although the word more broadly covers NVIDIA’s entire GPU computing platform (language + compiler + driver + libraries). For the vast majority of application developers, you actually touch it indirectly through precompiled libraries like cuBLAS / cuDNN rather than writing it yourself.
Layer 2 · PTX
Parallel Thread eXecution — NVIDIA’s virtual ISA (intermediate representation). All upper-layer entries (CUDA C++, Triton, CuTe DSL) eventually become PTX, then the ptxas in the driver compiles that down to SASS for the specific architecture.
Essentially no one writes a full kernel from PTX. Its real use is: inside a .cu file, drop in a few instructions via asm volatile to access hardware features the C++ layer doesn’t expose — a specific async copy, a particular cache hint, or a new instruction the compiler hasn’t caught up to yet. PTX is a patch tool, not a development language.
Layer 1 · SASS
Streaming ASSembly — the real machine code for a specific architecture. You don’t write it; you usually only look at it for extreme profiling or reverse engineering.
With these five layers laid out mentally, the ladder is in place. Two more pieces of context make it usable: how compilation funnels these upper-layer entries into PTX, and a parallel set of NVIDIA-shipped closed-source libraries that sit alongside the ladder.
Compilation paths — three frontends, one funnel into PTX
The top three layers (3, 4, 5) offer three different source-language entries, but all paths funnel into PTX — the same virtual ISA — which the driver’s ptxas then turns into SASS:
- CUDA C++ (.cu) — including everything written with CUTLASS / CuTe templates, going through
nvcc, the classical compilation chain. - Triton — has its own MLIR-based compiler; lowers Python directly to PTX.
- CuTe DSL — also MLIR-based; lowers Python-written CuTe layout algebra to PTX.
Neither Triton nor CuTe DSL produces an intermediate .cu file — they are parallel source-language entries to CUDA C++, not built on top of it. This is also why Triton can coexist with cuBLAS / cuDNN and CUTLASS inside PyTorch without conflict: they all end up as the same GPU machine code; only the source differs.
Off the ladder — NVIDIA’s parallel closed-source libraries
The five layers above all live in the world of “source languages” — you write code, then compile it. But NVIDIA also maintains a completely different family of things you use via closed-source, precompiled, single-line API calls. The two most common:
- cuBLAS (
CUDA Basic Linear Algebra Subroutines) — general linear algebra, GEMM / BLAS. - cuDNN (
CUDA Deep Neural Network library) — deep-learning-specific ops: convolutions, pooling, normalization, attention.
When you use PyTorch, matmul defaults to cuBLAS and convolutions default to cuDNN — these have been the foundation of PyTorch’s performance for years. Distinct from the tools on the ladder: these libraries are black boxes; you can’t see inside or change them, but as long as your parameters are right, you get NVIDIA’s years of tuning for free.
CUTLASS is exactly NVIDIA’s “open-source building-block version” of that closed-source family — the name itself nods to cuBLAS (CUDA Templates for Linear Algebra Subroutines). The functional domain overlaps (both can do GEMM) but the usage is entirely different:
CUTLASS / CuTe earn a layer of their own because they package not “nicer-looking C++ templates” but a body of NVIDIA’s accumulated algorithmic assets — how to avoid bank conflicts, how to use TMA, how to arrange warps, how to do pipelining — all expressed and composable through CuTe’s layout algebra and CUTLASS’s template parameters. That is precisely why the main-event operators (GEMM, attention) cannot bypass this layer when squeezing for peak. A fitting analogy: CuTe / CUTLASS are to CUDA C++ what NumPy is to hand-written C loops — they change how fast you can reach what performance level and how much you need to understand.
Code Comparison — five ways to write the same GEMM
The most direct way to feel this is to write the same task () five ways. The point isn’t to follow each line, but to feel how dramatically the amount of stuff you have to care about differs.
Way 1 · cuBLAS — black-box call, you just say “what to do”
// Create a handle, call one function, done. You don't care how it computes internally.
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0f, beta = 0.0f;
// C = alpha * A * B + beta * C
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_N, // A, B not transposed
M, N, K, // matrix dims
&alpha, dA, M, // input A
dB, K, // input B
&beta, dC, M); // output C
cublasDestroy(handle);
What you care about: matrix dims, transpose flags, which pointer to use. That’s it. How to tile, whether to use Tensor Core, how threads are partitioned — NVIDIA decides for you, and you can’t see or change it. That’s “black-box.” cuDNN looks essentially the same, just with cudnnConvolutionForward(...) etc.
Way 2 · Plain CUDA C++ — you write it yourself, but naively
// Each thread computes one element of C.
__global__ void gemm_naive(float* A, float* B, float* C, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float sum = 0.0f;
for (int k = 0; k < K; ++k) // write the dot-product loop yourself
sum += A[row * K + k] * B[k * N + col];
C[row * N + col] = sum;
}
}
// Launch: gemm_naive<<<grid, block>>>(dA, dB, dC, M, N, K);
What you care about: how threads map to elements, how to write the loop. Note: this runs, but it’s terribly slow — no shared memory, no Tensor Core, no memory-access optimization. To make it fast, you’d manually add tiling, SMEM staging, bank-conflict handling — hundreds to thousands of lines, and you rewrite it for every new architecture. That’s why CUTLASS / CuTe exist.
Way 3 · CuTe — describe “how data is laid out + who computes what” via layout algebra
CuTe’s essence: it gives you a vocabulary for declaratively describing data layout and thread mapping instead of hand-writing a pile of index math. Core concepts: Layout (shape + stride) and Tensor (data + Layout).
using namespace cute;
// Wrap raw memory as a layout-bearing Tensor.
// make_shape(M,K) is the shape; make_stride describes how it's laid out in memory.
Tensor mA = make_tensor(make_gmem_ptr(A), make_shape(M, K), make_stride(_1{}, M));
Tensor mB = make_tensor(make_gmem_ptr(B), make_shape(N, K), make_stride(_1{}, N));
Tensor mC = make_tensor(make_gmem_ptr(C), make_shape(M, N), make_stride(_1{}, M));
// Declare how to tile: each block handles a 128×128×8 slice.
auto block_tile = make_shape(Int<128>{}, Int<128>{}, Int<8>{});
// Use layouts to "slice" the global matrix into the chunk this block owns.
Tensor gA = local_tile(mA, block_tile, ...);
Tensor gB = local_tile(mB, block_tile, ...);
Tensor gC = local_tile(mC, block_tile, ...);
// Declare a "tiled MMA": which Tensor Core instruction, how warps are arranged.
TiledMMA mma = make_tiled_mma(SM80_16x8x8_F32F16F16F32_TN{}, ...);
// Then do multiply-accumulate over the layout-described tiles.
cute::gemm(mma, gA, gB, gC);
See the difference? No hand-written A[row*K+k] indices; instead, make_shape / make_stride / local_tile declare “what the data looks like and how it’s sliced”, and cute::gemm follows that declaration. That’s layout algebra — it manages data orchestration, not “call a stock function for you.”
Way 4 · CUTLASS — assemble a full kernel by filling template parameters
CUTLASS sits on top of CuTe and lets you assemble a production-grade kernel by filling in template parameters, without building from CuTe atoms yourself.
using namespace cutlass::gemm;
// Don't write a loop — "declare" what each dimension of this GEMM should use:
using Gemm = device::GemmUniversal<
cutlass::half_t, cutlass::layout::RowMajor, // A: dtype + layout
cutlass::half_t, cutlass::layout::ColumnMajor, // B
float, cutlass::layout::RowMajor, // C
float, // accumulate in float
cutlass::arch::OpClassTensorOp, // use Tensor Core
cutlass::arch::Sm90, // target arch: Hopper
Shape<_128,_128,_64>, // block tile size
Shape<_64, _64, _64> // warp tile size
/* plus epilogue, pipeline stages, and a pile of other knobs */ >;
Gemm gemm_op;
gemm_op({M, N, K}, {dA, lda}, {dB, ldb}, {dC, ldc}, {alpha, beta}); // launch
What you care about: dtype, layout, target arch, tile size, warp partitioning, Tensor Core or not — all performance knobs, you set them. CUTLASS compiles those choices into a highly optimized kernel. A minimal Hopper WGMMA + TMA GEMM example is about 100 lines.
Way 5 · Triton — Python tile-level; hide the hardware details
import triton
import triton.language as tl
@triton.jit
def gemm_kernel(A, B, C, M, N, K,
stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
# This block owns a [BLOCK_M, BLOCK_N] slice of C.
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
# Index vectors inside the tile (not threads — a slice of elements per block).
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# Iterate along K, load a BLOCK_K-wide tile each step, do tile-level matmul.
for k in range(0, K, BLOCK_K):
a = tl.load(A + offs_m[:, None] * stride_am + (k + offs_k[None, :]) * stride_ak)
b = tl.load(B + (k + offs_k[:, None]) * stride_bk + offs_n[None, :] * stride_bn)
acc += tl.dot(a, b) # compiler auto-lowers to Tensor Core MMA
tl.store(C + offs_m[:, None] * stride_cm + offs_n[None, :] * stride_cn, acc)
The key insight: the whole code has no threadIdx, no __shared__, no synchronization primitives. You think in tiles — load a tile, do a tile-level matmul, accumulate — and the Triton compiler decides automatically: how threads map, how shared memory is staged, which Tensor Core instructions to emit, how memory coalescing is done. This is its biggest paradigm difference from CUDA C++.
Putting all five side by side:
| Style | What you write at the core | What you need to understand | One-liner |
|---|---|---|---|
| cuBLAS / cuDNN | Fill params, call 1 function | Matrix dims | Order from the menu, never enter the kitchen |
| Plain CUDA C++ | Hand-written thread indices + loops | Thread model (and everything else, for speed) | Start from flour, but with the dumbest recipe |
| CuTe | Layout algebra to declare data layout + tile + MMA | Data layout, Tensor Core instructions, warp mapping | A kitchen with proper professional tools |
| CUTLASS | Fill template params to assemble a kernel | Same as above, but with a pre-built scaffold | A high-performance meal kit you just season |
| Triton | Python tile-level, no threadIdx / SMEM sync | Tile size, KV access patterns | Let the compiler lower to thread level for you |
The key intuition: with cuBLAS / cuDNN you say “I want a matmul,” and what’s inside is none of your business; with CuTe / CUTLASS / Triton you describe how the kernel should work internally, just using their high-level vocabulary (layout algebra / template params / tile abstractions) instead of writing raw indices.
Real Cases — FlashAttention and vLLM / SGLang
Two real projects to walk the ladder. FlashAttention shows the “vertical” evolution — the same operator descending and partially re-ascending the ladder as hardware generations roll over, each generation perched on “the highest-performance layer at the time.” vLLM and SGLang show the “horizontal” mix — within a single moment and a single system, different operators are placed on different layers, main operators routed to the best library and gaps filled by Triton. Put the two together and you essentially have the actual shape of industrial GPU programming today.
FlashAttention — a living fossil of NVIDIA’s programming evolution
If we have to pick one project to understand how this whole ecosystem evolved, it has to be FlashAttention. Each generation it changes hardware and changes the programming paradigm — it has walked through every major path in the stack. Read it as the “living fossil.”
The FA4 step deserves a callout: writing Python while still hitting C++-level performance — that is exactly the design goal of CuTe DSL, giving “Python-ecosystem main-event operators + bleeding-edge hardware + must extract 95%+” a path that doesn’t require falling back to C++. It and Triton fill different niches inside the Python world: Triton hides the hardware details (layout / TMA / WGMMA), so you don’t need to understand them to write code, with the trade-off that performance caps at 80-95%; CuTe DSL exposes those details, requiring you to understand them, in exchange for 95%+ control. One serves daily custom operators, the other the top 5-15% peak-extraction scenarios. FA4 picked the latter because attention on Hopper / Blackwell has to be pushed to peak — Triton wasn’t enough, and pure C++ templates were too costly.
vLLM and SGLang — the whole ladder, simultaneously, inside one system
vLLM and SGLang’s core strategy is to be a dispatch layer: hand work to a bunch of best-of-breed backends, and only fill in their own kernels in the gaps where no stock library reaches. They’re more like conductors than performers.
vLLM’s own README kernel list says it all: optimized attention kernels include FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, and Triton; optimized GEMM / MoE kernels use CUTLASS, TRTLLM-GEN, CuTeDSL; plus torch.compile for automatic kernel generation and graph-level transforms. At least six or seven different kernel sources in one sentence.
SGLang’s strategy is essentially the same; its built-in attention backends are even more varied, with MLA (DeepSeek-style attention) alone offering FlashInfer MLA, FlashMLA, Cutlass MLA, TRTLLM MLA each with a different page_size. The auto-select logic is telling too — Hopper defaults to fa3, Blackwell defaults to trtllm_mha, other architectures default to flashinfer with triton as fallback. The fallback chain spells out the policy: specialized libraries first → general libraries next → Triton as the safety net.
What did these two engines actually write themselves? Three categories:
- Dispatch / abstraction layer (their real proprietary value) — a pluggable attention/GEMM backend abstraction + the runtime logic that auto-picks the best one. That’s the real engineering moat, not the kernels themselves.
- Self-written Triton kernels — concentrated in two places: (a) operators that wrap their unique data structures (PagedAttention and KV cache fetch for paged KV cache), which no external library will write for you; and (b) cross-hardware fallbacks that guarantee it runs anywhere.
- Main high-performance operators — almost entirely outsourced to FlashAttention, FlashInfer, CUTLASS, CuTe DSL, TRT-LLM; they never rebuild these themselves.
This is an iron-clad confirmation of the earlier mental model: main-event operators (attention peak, GEMM) → C++ CUTLASS / CuTe and specialized libraries (peak performance always demands C++ libraries); peripheral / adapter / fallback operators → Triton (write fast in Python, good enough, portable). Neither top-tier production system uses a single tool — both mix tools across layers, by operator importance and scenario. That’s what real-world GPU programming looks like: not picking one path, but using the entire ladder simultaneously inside one system.
Decision Map — how to pick a technical path
Compress everything into one decision map you can use. A single question chains the layers — at each level, ask: “Did the layer above really fall short?” Only descend if it did, because every step down compounds development cost and required hardware knowledge.
Three rules to actually use this map:
- Default upward, not downward. Always try the most convenient layer first; only descend a notch after measurement shows it isn’t enough. Don’t write CUTLASS just because you heard it’s fast — most of the time Triton, or even a stock library, suffices. Premature optimization wastes lives.
- It’s not “pick one path” — it’s “mix inside one system.” This is the biggest lesson from vLLM / SGLang: main operators (attention / GEMM) use specialized libraries and CUTLASS, peripheral and adapter operators use Triton, the fallback is also Triton. Real systems use the entire ladder simultaneously; the goal is to put each operator at its correct layer.
- Distinguish “call” vs “write,” “black box” vs “building blocks.” cuBLAS / cuDNN are black boxes (API calls, uncustomizable); CUTLASS / CuTe / Triton are building blocks (write code, customizable). FlashAttention and the like are finished products others built with the blocks and that you in turn use as a black box. Be clear about whether you want “use what exists” or “build my own,” and the tool choice stops being confusing.
Closing — one-line summary
NVIDIA’s GPU programming landscape looks like a pile of names — PTX, CUDA C++, CUTLASS, CuTe, Triton, CuTe DSL, CuTile… But fundamentally it is a continuous ladder from hardware to humans: the bottom (PTX / SASS) is the exit, the middle (CUDA C++ + CUTLASS / CuTe) is the foundation and the extreme-performance tool, the top (Triton / CuTe DSL) is the Python-era fast entry. All paths converge on the same exit (PTX → SASS); the only difference is which layer you enter from.
Use stock libraries if you can; if you must write a kernel, default to Triton in Python, libraries-then-CUTLASS in C++; for extreme performance use CUTLASS / CuTe (or its Python frontend CuTe DSL); PTX is just a patch wherever a patch is needed.
Descend layer by layer, driven by measurement, mixing by operator — once you load this ladder into your mental model, you can read any project’s kernel directory structure and see exactly what they put at each layer and why.
References — official docs · representative projects · community discussion
NVIDIA official documentation
- CUDA Toolkit — CUDA documentation, PTX ISA reference, CUDA C++ Programming Guide
- CUTLASS — NVIDIA/cutlass on GitHub, CUTLASS 3.x docs, CuTe quick-start guide
- CUDA Tile IR / CuTile — CUDA 13.1 release notes, the CuTile programming model intro
Triton and PyTorch integration
- Triton project — triton-lang/triton on GitHub, Triton programming model paper (Tillet et al. 2019)
- TorchInductor — torch.compile docs, design notes for Triton as PyTorch 2.x’s default codegen backend
- Gluon — the “one level lower” DSL released by the Triton team, exposing tile layout / memory allocation and other low-level details
FlashAttention evolution
- FlashAttention 1 — Dao et al. (2022), “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
- FlashAttention 2 — Dao (2023), “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”
- FlashAttention 3 — Shah et al. (2024), “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision”; uses Hopper WGMMA + TMA, reaches ~740 TFLOPS
- FlashAttention 4 — the CuTe DSL implementation in Dao-AILab/flash-attention, covering both SM90 and SM100
Representative projects
- vLLM — vllm-project/vllm, multi attention-backend abstraction and auto-select
- SGLang — sgl-project/sglang, the hardware-driven backend fallback chain
- FlashInfer — flashinfer-ai/flashinfer, the shared kernel router for inference engines
- Unsloth — unslothai/unsloth, Triton-rewritten training / LoRA kernels for popular LLMs
- Liger Kernel — linkedin/Liger-Kernel, LinkedIn’s Triton kernel set for training
- ThunderKittens — HazyResearch/ThunderKittens, Stanford Hazy Research’s “warp-centric” C++ embedded library
- llm.c — karpathy/llm.c, Karpathy’s pure C / CUDA training implementation, extremely valuable for teaching
Industry overviews and blogs
- 35 Modern GPU Kernel Frameworks — a survey blog that lays out the lineage of GPU kernel frameworks; the evolution arc CUDA C++ → Triton → CuTe → ThunderKittens → CuTile / CuTe DSL is drawn out clearly
- Tri Dao’s blog and talks — the FlashAttention author’s first-person account of “why I switched tools”
- GPU MODE community — the Discord community with the densest high-quality Triton / CUTLASS / CuTe discussion
Related on this blog
- A Decade of GPU Architecture Evolution and the Parallel Bloat of the CUDA Programming Model — the Pascal → Rubin hardware path, and how the CUDA programming model expanded from a single thread layer to five (grid → cluster → block → warp → thread)
- The AI Inference Chip Spectrum — Seven Gradients from General GPU to Model-Etched Silicon — a deep dive on the NVIDIA path at the spectrum’s leftmost gradient
- A Layer-by-Layer Walkthrough of LLM Inference — why attention / GEMM are “main-event operators” and where they fit in a single forward pass