GPU & CUDA — The Ultimate Guide · Architecture · Inference

Section 01 · The Machine

GPU & CUDA Architecture, from first principles.

Start with the silicon. Understand why a GPU is shaped the way it is, what every box in the die diagram actually does, and how the CUDA programming model lets a single C function ignite tens of thousands of threads in a few microseconds.

01. CPU vs GPU — the two philosophies of compute

To understand a GPU, you first have to understand what it is not. A CPU and a GPU both process data, but they are built around opposite design philosophies. They optimize for completely different things, and that is the single most important fact in this entire guide.

Analogy

Picture a CPU as a brilliant senior surgeon — it can perform any operation in the hospital, makes complex sequential decisions, but there's only a handful of them. Now picture a GPU as a stadium full of trained interns — each one can only do simple, identical steps, but ten thousand of them moving in unison can stitch up a city. The CPU optimizes per-thread latency; the GPU optimizes aggregate throughput.

CPU

Latency machine

4–128 powerful cores. Huge caches (often more than 50% of the die). Massive branch predictors and out-of-order execution. Goal: finish any single task as fast as possible, including unpredictable, branchy, sequential workloads.

GPU

Throughput machine

10,000–25,000 simple cores. Tiny caches per core. No branch prediction worth mentioning. Goal: finish millions of identical tasks per second by running them in massive parallel batches. Latency on any single thread is awful — aggregate throughput is staggering.

Why this matters for AI and graphics

Training a neural network, rendering a frame of a game, simulating fluid dynamics — these workloads all share the same DNA: do the same arithmetic on millions of data elements. Matrix multiplications, pixel shading, ray-triangle intersections — they decompose naturally into "perform operation X on element Y, where Y is independent of every other element". This is the only kind of work where you actually need ten thousand interns instead of one surgeon.

Key insight

The CPU spends most of its transistor budget on control logic — figuring out what to do next. The GPU spends most of its transistor budget on arithmetic units — actually doing the work. On a modern H100, roughly 90% of the die area is compute, not control. That ratio is the GPU's superpower.

02. The physical die — GPU → GPC → TPC → SM

A modern NVIDIA GPU is not one monolithic chip. It is a hierarchical fractal of smaller and smaller compute units, each level grouping the level below. Once you internalize this hierarchy, every spec sheet suddenly makes sense.

FIG 02.1 · Physical die hierarchy

From whole die down to a single Streaming Multiprocessor. Each level groups roughly an order of magnitude more compute.

The 5-level hierarchy

GPU DIE

The whole chip. A single physical piece of silicon (or, on Blackwell, two dies stitched together with NV-HBI). Contains every GPC, all caches, memory controllers, NVLink/PCIe interfaces, and dedicated engines.

GPC · Graphics Processing Cluster

A self-contained "mini-GPU" with its own raster engine. The H100 die has 8 GPCs. Conceptually you can think of a GPC as an autonomous worker team that can handle a slice of any workload independently.

TPC · Texture Processing Cluster

A pair of SMs that share a texture unit and some load/store hardware. The grouping comes from the GPU's graphics heritage — in compute workloads you mostly think one level above (GPC) and one level below (SM) and ignore the TPC.

SM · Streaming Multiprocessor

The fundamental compute building block. This is where your CUDA threads actually run. An SM contains 128 CUDA cores, 4 Tensor Cores, a register file, L1/shared memory, warp schedulers, and (on RTX parts) an RT core. This is the brain of the GPU.

CUDA Core

The smallest functional unit. Roughly equivalent to a single scalar arithmetic pipeline (FP32 ALU + INT32 ALU). Modern GPUs have between 14,000 and 25,000 of these. They are organized into groups of 32 that execute in lockstep — a warp.

Specialized engines

Sitting alongside the SMs: Tensor Cores (matrix math), RT Cores (ray tracing), TMA / Tensor Memory Accelerator (Hopper+), Decompression Engine (Blackwell), NVENC/NVDEC (video), and copy engines for async data movement.

A single H100 SXM5 chip contains 132 SMs × 128 CUDA cores = 16,896 cores running in parallel. The "144 SMs / 18,432 cores" number you sometimes see refers to the full GH100 silicon, but NVIDIA ships H100 with some SMs disabled for yield reasons.

03. Inside a Streaming Multiprocessor — the compute engine

If you only memorize one diagram from this entire guide, make it the SM. Everything — every kernel, every CUDA call, every Tensor Core operation — ultimately runs inside this little block of silicon. An SM is itself divided into 4 processing partitions, each operating semi-independently. The structure has been remarkably stable since Volta (2017) and only gets refined each generation.

FIG 03.1 · SM block diagram (Hopper / Blackwell SM)

Each partition is a near-independent vector machine. The four partitions share one 228 KB L1/Shared bank and (on RTX) one RT core.

The four partitions, explained

Why four partitions? Because one Tensor Core operation produces enough work to keep 32 CUDA cores busy for many cycles, and you want four such streams running concurrently inside each SM to maximize utilization. Each partition is a self-contained pipeline:

Warp Scheduler — the brain of the partition. Picks a ready warp (32 threads) every clock cycle and issues an instruction.
Dispatch Unit — sends the chosen instruction to the appropriate execution units.
Register File — 64 KB of registers (16,384 32-bit registers per partition, 65,536 per SM). Threads keep all their working variables here.
32 FP32 CUDA cores — the bread-and-butter math units.
16 FP64 cores + 32 INT32 cores — doubles and integers.
1 Tensor Core — the matrix-multiply monster (more on this later).
LD/ST — load/store units for memory operations.
SFU · Special Function Units — transcendentals like sin, cos, exp, sqrt, 1/x.

SM = the unit of scheduling

When you launch a CUDA kernel, the GPU's scheduler hands out thread blocks to SMs. Each SM can hold multiple blocks resident at once (up to 32 on Hopper, depending on resource usage). When one block stalls on memory, the SM instantly switches to another. This is the foundation of latency hiding.

04. The CUDA Core — what it actually is

The term "CUDA core" is one of the most misunderstood in all of GPU computing. NVIDIA uses it as a marketing-friendly proxy for "scalar arithmetic pipeline", but a CUDA core is not a complete processor in the way a CPU core is. It's much narrower.

The honest definition

A CUDA core is a single FP32 arithmetic pipeline — one lane in a 32-wide SIMT execution unit. It can execute one floating-point multiply-add per clock. It has no independent program counter (32 of them share one), no branch prediction, no instruction decoder. It exists to do math, fast, alongside 31 of its identical siblings.

What an FP32 CUDA core can do

1 FP32 FMA (fused multiply-add) per cycle — counts as 2 floating-point operations
1 FP32 addition or multiplication per cycle
Integer ops on Volta+ (when paired with its INT32 sibling unit, can run integer math simultaneously with FP32)
Logical/bitwise operations

What it cannot do alone: matrix multiplications (that's the Tensor Core), ray-triangle intersections (RT Core), reading textures (Texture Unit), evaluating sin/cos (SFU). When you write __expf(x) in your kernel, the work goes to the SFU, not the CUDA core.

Why we still count them

Despite the limitations, CUDA core count remains the single best first-order indicator of a GPU's general-purpose compute throughput. The math is simple:

// Theoretical FP32 throughput (TFLOPS)
peak_FP32 = num_CUDA_cores × boost_clock_GHz × 2
                                                ^ 2 FLOPs per FMA

// Example: H100 SXM5
peak_FP32 = 16896 × 1.98 × 2 = 66.9 TFLOPS

// Example: B200
peak_FP32 = ~18944 × ~2.45 × 2 ≈ 80 TFLOPS
            throughput math
          

A CUDA core is not equivalent to a CPU core. Roughly speaking, one CPU core ≈ one warp scheduler + one set of vector pipes, which is closer to half an SM. So an H100 has 132 × 4 = 528 "CPU-equivalent" execution contexts — not 16,896.

05. Thread → Warp → Block → Grid — the CUDA execution model

Now we step from hardware into software. CUDA gives you a programming abstraction that maps cleanly onto the physical hierarchy you just saw. Master this 4-level model and you can write any GPU kernel.

FIG 05.1 · CUDA execution hierarchy (software side)

CUDA software hierarchy. Hopper added an optional 5th level — the Thread Block Cluster — that groups blocks across SMs.

The model in one paragraph

A CUDA kernel launch creates a grid of blocks. Each block holds up to 1,024 threads. The GPU's block scheduler distributes blocks across the SMs; once a block lands on an SM, it stays there until done. Inside the SM, the threads in each block are partitioned into warps of exactly 32 threads. The warp is the actual unit of hardware execution — all 32 threads in a warp execute the same instruction at the same time, just on different data.

Level by level

Thread

The smallest unit. Has its own program counter (effectively), its own slice of the register file, and its own slot in shared memory if you ask for one. Inside your kernel, a thread can ask "who am I?" via threadIdx.x, and that's how you assign each thread its slice of the data:

__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;  // every thread computes 1 element
  if (i < N) C[i] = A[i] + B[i];
}

// host side
vectorAdd<<<4096, 256>>>(d_A, d_B, d_C, N);
                ^^^^   ^^^
                grid   block
                (in blocks)  (in threads)
            vector-add kernel
          

Warp

32 consecutive threads inside a block, glued together by the hardware. The warp size has been 32 since the original Tesla architecture in 2006 — do not design code around any other warp size. The first 32 threads in your block (threadIdx 0–31) form warp 0; the next 32 form warp 1; and so on. A warp is the unit of scheduling and execution — the GPU never schedules an individual thread; it schedules a warp.

Golden rule

Always choose blockDim as a multiple of 32. If you use 33 threads per block, the hardware still allocates a full second warp with 31 lanes masked off — you pay the cost of 64 threads to do the work of 33.

Block (a.k.a. Cooperative Thread Array, CTA)

A group of threads guaranteed to run on the same SM at the same time. This co-residency is what enables the two superpowers of a block:

Shared memory — a tiny (~228 KB on Hopper, configurable), ultra-fast scratchpad that all threads in the block can read and write. Roughly 100× faster than HBM.
Barrier sync — __syncthreads() makes every thread wait at a line until all others arrive. Impossible to do across blocks; trivial inside one.

Max block size is 1,024 threads (= 32 warps). A typical good block size is 128, 256, or 512 threads.

Grid

The complete set of blocks launched by a single kernel call. Can be 1D, 2D, or 3D — just a convenience for thinking about the index space. Blocks in the same grid have no direct way to communicate or synchronize during the kernel (other than through global memory atomics or kernel completion). This independence is the source of the GPU's scalability: if your grid has 10,000 blocks and your GPU has 132 SMs, the scheduler just keeps feeding blocks to whichever SM is free.

Thread Block Cluster Hopper+

A new 5th level introduced in 2022 with the H100. A cluster is a group of up to 16 thread blocks that the hardware guarantees will be running concurrently on neighboring SMs, so they can directly read each other's shared memory (called distributed shared memory or DSMEM). This is a big deal for kernels that need cooperation across more threads than fit in a single block — like large GEMM tiles and attention kernels.

Analogy · the office building

Think of the grid as a whole skyscraper full of project teams. Each block is one team in one room (one SM) — they can pass paper around their conference table (shared memory) and shout "wait!" to each other (barrier sync). Each warp is a 32-person sub-pod in the team marching in lockstep. The thread is one person at their desk. The cluster is a few neighboring teams that have a fast internal phone line (DSMEM) to coordinate without going to the lobby (HBM).

06. SIMT — how 32 threads share one instruction

NVIDIA's name for their execution model is SIMT — Single Instruction, Multiple Threads. It's a clever twist on the older SIMD (Single Instruction, Multiple Data) model used by CPU vector units.

CPU SIMD (AVX-512)

One program, one vector register

You write code that explicitly says "operate on this 512-bit register containing 16 floats". Branches and data divergence are your problem — the compiler / programmer must arrange the data to fit the lane structure.

GPU SIMT

32 programs, hardware fuses them

You write code as if 32 independent threads each had their own logic. The hardware discovers at runtime that they share a PC and fuses their execution into one warp instruction. If they diverge (different branches), the hardware splits them and runs each path serially with the inactive lanes masked off.

Warp divergence

When threads in a warp take different paths through a branch, the warp diverges. The hardware handles this gracefully but at a real cost:

__global__ void divergentExample(int* data) {
  int i = threadIdx.x;
  if (data[i] % 2 == 0) {
    // Path A · taken by, say, lanes 0,2,4,...,30 (16 threads)
    data[i] = data[i] / 2;
  } else {
    // Path B · taken by lanes 1,3,5,...,31 (16 threads)
    data[i] = data[i] * 3 + 1;
  }
  // Warp re-converges here
}
            warp divergence
          

The hardware runs Path A first with 16 lanes active and 16 masked off; then runs Path B with the opposite mask. Total cost: 2× a normal warp, instead of the parallelism gain you'd expect. The lesson: design data layouts so consecutive threads do similar work. If lane 0's job is fundamentally different from lane 1's, you're fighting the architecture.

Independent Thread Scheduling Volta+

Pre-Volta hardware reconverged threads at the end of a divergent block automatically. Volta added per-thread program counters, allowing threads in a warp to run truly independently across long divergent code — at the cost of programmer responsibility for explicit reconvergence via __syncwarp(). This makes complex inter-thread coordination patterns possible (mutexes inside a warp, fine-grained producer-consumer), with the trade-off that you can now write deadlocks where Pascal would have just worked.

07. Warp scheduling & latency hiding — the GPU's secret weapon

Here is the most beautiful idea in GPU architecture: the GPU is happy to ignore memory latency because it always has other work to do. A CPU sees a cache miss and stalls. A GPU sees a cache miss and just runs a different warp.

FIG 07.1 · Warp scheduler hiding HBM latency

Eight resident warps per scheduler, one issued per clock. As long as some warp is ready, the execution units never stall.

An SM holds many resident warps

Up to 64 warps (= 2,048 threads) per SM on Hopper/Blackwell. Each scheduler tracks 16 of them. Crucially, this is far more threads than there are CUDA cores in the SM — that oversubscription is the whole point.
Every clock, the scheduler picks 1 "ready" warp

A warp is ready when its next instruction has all its operands available (registers ready, no pending memory load, no barrier blocking it). The scheduler dispatches that warp's instruction to the appropriate execution unit.
Stalled warps cost nothing

When a warp issues a load from HBM and has to wait ~400 ns for the data, the scheduler just picks a different ready warp. The stalled warp's registers still occupy space in the register file, but no execution units are wasted.
High occupancy = good latency hiding

If you have 32+ warps per scheduler, the probability that at least one is ready every cycle is ~100%, and the GPU runs at peak throughput. Low occupancy (few warps per SM) is one of the most common performance problems.

Why this is genius

A CPU spends massive transistor budgets on cache hierarchies and out-of-order execution to avoid memory stalls. A GPU just runs other work during the stall. Both approaches hide latency — but the GPU's solution is far cheaper per transistor, which is why a GPU can pack so many more arithmetic units onto a die.

08. The memory hierarchy — from registers to HBM

Almost every GPU performance problem is, ultimately, a memory problem. The arithmetic units are insanely fast; the memory has to keep up. Understanding the hierarchy — six levels, each ~5–30× slower than the one above — is essential.

FIG 08.1 · Memory hierarchy on a Blackwell SM

Latency numbers are approximate cycle counts on Hopper/Blackwell. Bandwidth/capacity grows as you go down; latency grows even faster.

Level by level

Registers · ~1 cycle

Per-thread storage. ~256 KB total per SM (65,536 × 32-bit), partitioned across all resident threads. The fastest memory there is. The compiler decides what lives here.

L1 / Shared · ~30 cycles

Combined 228 KB scratchpad per SM. The split between "L1 cache" (hardware-managed) and "Shared memory" (programmer-managed) is configurable. Shared memory is the GPU's killer feature for cooperative algorithms.

TMEM Blackwell

256 KB of Tensor Memory per SM, brand new in Blackwell. A dedicated buffer the Tensor Cores read accumulator state from. Frees up shared memory for other uses during heavy matmul kernels.

L2 Cache · ~200 cycles

Shared across the entire GPU. 50 MB on H100, 126 MB on B200, >100 MB on RTX PRO 6000. Massive last-line cache that helps amortize HBM bandwidth across SMs. Hopper added L2 residency hints.

HBM3 / HBM3e · ~500 cycles

The GPU's main RAM. High Bandwidth Memory: stacks of DRAM dies bonded directly to the GPU package via silicon interposer. 3.35 TB/s on H100, 4.8 TB/s on H200, 8 TB/s on B200. The bandwidth here determines how fast your kernels can really go.

Host RAM · 10,000+ cycles

Regular DDR5 attached to the CPU. Crossing PCIe Gen5 (~64 GB/s) or NVLink-C2C (Grace Hopper, ~450 GB/s) to get to it. Avoid touching during a kernel; transfer data upfront and keep it on the GPU.

Coalesced access — the #1 optimization

When a warp issues a global memory load, the hardware coalesces the 32 thread requests into the smallest set of 32-, 64-, or 128-byte transactions possible. The best case is when the 32 threads read 32 consecutive 4-byte elements: one 128-byte transaction serves the whole warp.

// COALESCED · ideal
float x = data[threadIdx.x];          // 32 threads read 32 contiguous floats → 1 transaction

// NON-COALESCED · 32× more transactions
float x = data[threadIdx.x * 128];    // 32 threads stride by 128 → 32 separate transactions
            memory access patterns
          

The bandwidth difference between these two patterns can easily be 30×. Most "my GPU kernel is slow" problems are non-coalesced loads in disguise.

09. Tensor Cores — the matrix-multiply specialists

Introduced with Volta in 2017, the Tensor Core is arguably the single most important hardware unit of the AI era. Where a CUDA core does one multiply-add per cycle, a Tensor Core does an entire small matrix multiply per cycle. The throughput-per-transistor difference is enormous, which is why AI on GPUs took off the moment Tensor Cores arrived.

FIG 09.1 · The Tensor Core MMA operation

D = A·B + C, where A and B are low-precision matrices, C and D are higher-precision accumulators. One Tensor Core instruction = thousands of multiply-adds.

Tensor Core generations

Gen	First arch	Year	New formats	Notable feature
1st	Volta (V100)	2017	FP16 · FP32 accum	The original; 4×4×4 MMA
2nd	Turing (T4 / RTX 20)	2018	+ INT8, INT4	Consumer Tensor Cores
3rd	Ampere (A100 / RTX 30)	2020	+ TF32, BF16, sparsity	2:4 structured sparsity 2×
4th	Hopper (H100 / H200)	2022	+ FP8 (E4M3, E5M2)	Transformer Engine
5th	Blackwell (B200 / B300 / RTX PRO)	2024–25	+ FP6, FP4 (NVFP4)	2nd-gen Transformer Engine, TMEM

The precision menu

FP32 (single) accum

8-bit exponent, 23-bit mantissa. The classic "float". Used as the high-precision accumulator for Tensor Core outputs.

TF32 Ampere+

8-bit exponent (FP32 range), 10-bit mantissa (FP16 precision). Drops 13 bits of precision for free 2× throughput. Default for training on A100+.

BF16

8-bit exponent, 7-bit mantissa. FP32-like range, ~3-digit precision. Almost universally the format used for training large LLMs since 2022.

FP16

5-bit exponent, 10-bit mantissa. Smaller range than BF16 — can overflow during training without care. Still widely used for inference.

FP8 Hopper+

Two variants: E4M3 (better precision, narrower range) and E5M2 (wider range, less precision). 2× throughput over FP16. Standard for state-of-the-art inference.

FP4 / NVFP4 Blackwell

4 bits per number. Two-level scaling (per-block FP8 scale + per-tensor FP32 scale) preserves dynamic range. 2× throughput over FP8. Blackwell's headline inference format.

Throughput in numbers

Format	H100 SXM5 (dense)	B200 (dense)	RTX PRO 6000 (dense)
FP32 (CUDA cores)	67 TFLOPS	~80 TFLOPS	~125 TFLOPS
TF32	989 TFLOPS	2.2 PFLOPS	~500 TFLOPS
BF16 / FP16	1.98 PFLOPS	4.5 PFLOPS	~1 PFLOPS
FP8	3.96 PFLOPS	9 PFLOPS	~2 PFLOPS
FP4 / NVFP4	—	18 PFLOPS	~4 PFLOPS

Transformer Engine

Hopper introduced an NVIDIA software library + hardware support that dynamically chooses FP8 vs FP16 per layer during training. It tracks tensor statistics and decides which precision keeps quality without overflow. The 2nd-gen Transformer Engine on Blackwell extends this to FP4/NVFP4. This is one big reason FP8 training "just works" on Hopper while older hardware needed manual loss scaling.

10. RT Cores — ray tracing in silicon

Introduced with Turing in 2018, RT Cores are dedicated hardware for two operations: ray/box intersection (BVH traversal) and ray/triangle intersection. Without them, ray tracing requires hundreds of CUDA-core ops per ray; with them, it's a handful of cycles per intersection test.

Generations

1st gen (Turing, 2018) — the original. Real-time ray tracing in games becomes feasible.
2nd gen (Ampere, 2020) — 2× ray-triangle throughput, motion blur acceleration.
3rd gen (Ada, 2022) — another 2× ray-triangle; opacity micromaps; displaced micro-meshes.
4th gen (Blackwell, 2024) — 2× again on triangle intersect; RTX Mega Geometry; linear swept spheres for hair/foliage.

RT Cores are only present in RTX-branded parts (consumer RTX, Workstation RTX PRO). Data-center cards like A100, H100, B200 have no RT Cores — they're not designed for graphics.

A full discussion of how rays interact with the BVH and what the RT Core actually computes lives in Section 03 on the graphics pipeline. For now, just remember: RT Cores are a third type of execution unit alongside CUDA Cores and Tensor Cores, each specialized for a different category of work.

11. Eighteen years of NVIDIA architectures

The story of NVIDIA's architectural evolution is the story of modern parallel computing. Each generation has typically delivered ~2× throughput per dollar, while adding one major new capability that reshaped what GPUs could do. Here's the full timeline.

Arch	Year	Flagship	Process	What it introduced
Tesla	2006	G80 / 8800 GTX	90nm	The original CUDA-capable GPU. Unified shaders.
Fermi	2010	GF100 / GTX 480	40nm	True caches (L1/L2), ECC, double-precision math, IEEE-754.
Kepler	2012	GK110 / Tesla K20	28nm	Dynamic parallelism, Hyper-Q, GPUDirect RDMA.
Maxwell	2014	GM200 / Titan X	28nm	Massive efficiency gains. Unified L1/Shared.
Pascal	2016	GP100 / P100	16nm	HBM2, NVLink 1.0, FP16 throughput, unified memory.
Volta	2017	GV100 / V100	12nm	Tensor Cores (1st gen). Per-thread program counters. The AI era begins.
Turing	2018	TU102 / RTX 2080 Ti	12nm	RT Cores, 2nd-gen Tensor Cores. DLSS 1.0.
Ampere	2020	GA100 / A100	7nm	3rd-gen Tensor (TF32, BF16, 2:4 sparsity), MIG partitioning, 2nd-gen RT.
Ada Lovelace	2022	AD102 / RTX 4090	4N	3rd-gen RT, DLSS 3 Frame Generation. Consumer-only.
Hopper	2022	GH100 / H100	4N	4th-gen Tensor (FP8), Transformer Engine, TMA, DPX, Thread Block Clusters.
Hopper (refresh)	2024	GH200 / H200	4N	Same SM as H100; HBM3e: 141 GB · 4.8 TB/s. Inference-tuned.
Blackwell	2024	GB200 / B200	4NP	5th-gen Tensor (FP4/FP6), TMEM, dual-die NV-HBI, NVLink 5.0, Decompression Engine.
Blackwell Ultra	2025	GB300 / B300	4NP	288 GB HBM3e, 15 PFLOPS FP4 dense, 1100 W TDP.
Blackwell (consumer/pro)	2025	GB202 / RTX PRO 6000	4NP	4th-gen RT, neural shaders, DLSS 4, 96 GB GDDR7. RTX 50-series silicon.

"4N" and "4NP" are TSMC custom processes — not the same as TSMC N4 / N4P. Marketing names; both physically derive from the 5nm class. Process names are not transistor sizes; the geometry hasn't matched the name in over a decade.

12. H100 deep dive — the workhorse of the AI boom

Announced March 2022. From early 2023 through 2025 the H100 was, by revenue and shipments, the most successful AI accelerator in history. Every frontier lab trained their flagship models on tens of thousands of H100s. If you've used ChatGPT, Claude, or Gemini in the last two years, you've talked to an H100.

Architecture

The GH100 die has 144 SMs across 8 GPCs, fabricated on TSMC's 4N process — 80 billion transistors on a 814 mm² die. The shipping H100 SXM5 SKU enables 132 SMs for yield, giving 16,896 CUDA cores and 528 fourth-generation Tensor Cores. PCIe-form-factor H100 is more cut down at 114 SMs.

Each SM follows the standard four-partition layout we covered in Block 03, with 128 CUDA cores, 4 Tensor Cores, 228 KB of combined L1/Shared, and the new TMA (Tensor Memory Accelerator) for async global-to-shared transfers. The Tensor Cores are 4th gen, adding native FP8 (E4M3 and E5M2) which delivers 2× the throughput of FP16 with surprisingly graceful accuracy.

Memory & Interconnect

80 GB of HBM3 at 3.35 TB/s across five active stacks of a 5,120-bit memory bus. 50 MB of L2 cache. 18 NVLink 4.0 links delivering 900 GB/s of GPU-to-GPU bandwidth — the foundation of every 8-GPU DGX node and HGX server.

The features that matter

FP8 Tensor Cores + Transformer Engine — doubled training throughput overnight.
Thread Block Clusters & DSMEM — cross-SM shared memory for huge GEMM/attention tiles.
TMA — async DMA between HBM and shared memory; frees CUDA cores from address arithmetic.
DPX instructions — hardware acceleration for dynamic programming (genomics, route planning).
Confidential Computing — trusted execution environment with attestation.

The H200 refresh (2024) keeps the exact same GH100 die but pairs it with 141 GB of HBM3e at 4.8 TB/s. Same compute; 1.4× the bandwidth and 1.76× the capacity. A targeted inference upgrade.

13. B200 deep dive — the dual-die jump

Announced March 2024, shipping in volume through 2025. Blackwell B200 is the most ambitious single-package GPU NVIDIA has ever built — two reticle-limit dies bonded into one logical GPU, with a new die-to-die fabric (NV-HBI) delivering 10 TB/s between them. From software's perspective it presents as one device.

Architecture

Each die is fabricated on TSMC 4NP and holds 104 B transistors. The complete B200 package is 208 B transistors — 2.6× more than H100. The shipping configuration enables 148 SMs across 8 GPCs, holding ~18,944 CUDA cores and 592 fifth-generation Tensor Cores.

The SM has been heavily reworked: the 5th-gen Tensor Core adds FP4 and FP6 formats, including NVFP4 with its two-level micro-block scaling. A new Tensor Memory (TMEM) — 256 KB per SM — serves as the dedicated accumulator buffer for Tensor Core output, freeing up shared memory. Async copy paths through the Tensor Memory Accelerator are also faster.

Memory & Interconnect

192 GB of HBM3e across 8 stacks delivering a staggering 8 TB/s. The L2 cache grows to 126 MB. NVLink 5.0 doubles bandwidth to 1.8 TB/s per GPU, and the new NVLink Switch System lets racks scale to 72 fully-connected GPUs (GB200 NVL72) where the entire 72-GPU domain looks like one giant accelerator.

New engines

2nd-gen Transformer Engine — native FP4 / NVFP4 path for both training and inference.
Decompression Engine — hardware decompression (LZ4, Snappy, Deflate) for ETL/analytics workloads.
RAS Engine — predictive reliability monitoring across the 208 B transistors.
Confidential Computing v2 — extended trust model with TEE-encrypted NVLink.

The Blackwell Ultra refresh (B300, 2025) pushes further: 288 GB HBM3e, 15 PFLOPS dense FP4, 1100 W TDP. Same architecture, more of everything.

14. RTX PRO 6000 Blackwell — the workstation flagship

The RTX PRO 6000 Blackwell Workstation Edition (2025) is the highest-end professional graphics card NVIDIA ships — the spiritual successor to the Quadro RTX line. It's built on the same GB202 silicon as the consumer RTX 5090, but with full chip enablement, 4× the VRAM, ECC memory, and certified workstation drivers.

Architecture

The GB202 die is single-die (unlike the dual-die B200) and shares the SM architecture: 5th-gen Tensor Cores with full FP4 support, but unlike B200 it also includes 4th-gen RT Cores — data-center Blackwell (B200) has none. The RTX PRO 6000 enables the full die: 188 SMs · 24,064 CUDA cores · 752 Tensor Cores · 188 RT Cores.

Memory & Interconnect

96 GB of GDDR7 with ECC on a 512-bit bus delivering 1.8 TB/s. No HBM — GDDR7 is cheaper and still very fast. No NVLink either — this is a single-GPU workstation card. PCIe Gen5 x16 (~64 GB/s) to the host.

Who it's for

The 96 GB capacity is the killer feature: you can fit Llama 3 70B at FP4, or fine-tune mid-sized models, or run heavy 3D / simulation / video production tools, on a single workstation. List price around USD 8,500 — one-tenth of an H100, with comparable inference performance for smaller models. For solo AI developers and creative pros, it's currently the most capable single workstation GPU on Earth.

15. Side-by-side — the five GPUs in one table

Pin this table. Most of the time, choosing the right NVIDIA GPU for a given workload is just a matter of reading down one column.

Spec	H100 SXM5	H200 SXM	B200	B300 (Ultra)	RTX PRO 6000 BW
Architecture	Hopper	Hopper	Blackwell	Blackwell Ultra	Blackwell (cons)
Release	2022	2024	2024	2025	2025
Process	TSMC 4N	TSMC 4N	TSMC 4NP	TSMC 4NP	TSMC 4NP
Transistors	80 B	80 B	208 B (2-die)	208 B (2-die)	~92 B
SMs	132	132	148	148	188
CUDA cores	16,896	16,896	~18,944	~18,944	24,064
Tensor Cores	528 · 4th	528 · 4th	592 · 5th	592 · 5th	752 · 5th
RT Cores	—	—	—	—	188 · 4th
FP32 peak	67 TFLOPS	67 TFLOPS	~80 TFLOPS	~85 TFLOPS	~125 TFLOPS
BF16/FP16 dense	1.98 PF	1.98 PF	4.5 PF	5 PF	~1 PF
FP8 dense	3.96 PF	3.96 PF	9 PF	10 PF	~2 PF
FP4 / NVFP4 dense	—	—	18 PF	15 PF*	~4 PF
Memory capacity	80 GB HBM3	141 GB HBM3e	192 GB HBM3e	288 GB HBM3e	96 GB GDDR7 ECC
Memory bandwidth	3.35 TB/s	4.8 TB/s	8 TB/s	8 TB/s	1.8 TB/s
L2 cache	50 MB	50 MB	126 MB	126 MB	~96 MB
NVLink	900 GB/s (4.0)	900 GB/s (4.0)	1.8 TB/s (5.0)	1.8 TB/s (5.0)	None
Host interconnect	PCIe Gen5 x16	PCIe Gen5 x16	PCIe Gen5 x16	PCIe Gen5 x16	PCIe Gen5 x16
TDP	700 W	700 W	1000 W	1100 W	600 W
Form factor	SXM5	SXM	SXM (HGX)	SXM (HGX)	Dual-slot PCIe
Cooling	Liquid (DGX)	Liquid	Liquid	Liquid	Air or liquid

* B300 specs reflect NVIDIA's announced "Blackwell Ultra" figures; some sources state higher per-GPU FP4 (up to 15 PFLOPS dense / 30 PFLOPS sparse). The 18 PFLOPS for B200 is the dense figure NVIDIA quotes for the standard SXM SKU. Final SKU numbers may vary slightly by configuration.

16. NVLink & interconnect — scaling beyond one GPU

For models that exceed one GPU's memory, or training runs that need thousands of GPUs cooperating tightly, the interconnect matters as much as the GPU itself. NVIDIA's bet has been to treat the interconnect as part of the architecture, not an afterthought. NVLink, NVSwitch, and the NVLink Switch System are the result.

PCIe Gen5 x16

~64 GB/s. The standard way the GPU talks to the host CPU. Adequate for batched data movement and storage I/O; far too slow for fine-grained collective ops across GPUs.

NVLink 4.0 Hopper

900 GB/s per GPU, ~18 links of 50 GB/s each. Connects the 8 GPUs in an HGX H100 baseboard through NVSwitches into an all-to-all topology.

NVLink 5.0 Blackwell

1.8 TB/s per GPU — exactly 2× NVLink 4.0. 18 links × 100 GB/s. Foundation of the NVL72 rack-scale system.

NVSwitch

A dedicated networking ASIC that creates a non-blocking fabric between 8 (Hopper) or up to 72 (Blackwell) GPUs. Every GPU sees every other GPU at full NVLink bandwidth.

NVLink Switch System NVL72

Blackwell-era rack architecture. 72 B200 GPUs in one liquid-cooled rack, fully NVLink-connected at 1.8 TB/s each. Looks like a single 13.8 TB GPU. Designed for trillion-parameter training and inference.

InfiniBand / Spectrum-X

~400–800 Gb/s per port. Connects racks together into a SuperPOD — 10,000+ GPUs cooperating on a single training job. RDMA enables direct GPU-to-GPU writes across nodes.

Why NVLink is a moat

Modern training relies on heavy collective communication (all-reduce, all-gather) every step. At trillion-parameter scale, every gradient update is gigabytes of cross-GPU traffic. AMD and Intel have competitive raw FLOPS, but no one ships an interconnect like NVLink Switch + NVSwitch at NVIDIA's scale. This is why NVIDIA still owns the data center: it's not just the GPU, it's the whole rack.

17. CUDA programming — tying it all together

We've covered the hardware, the execution model, the memory hierarchy, and the specialty units. Let's end with a concrete look at the software side: what a CUDA kernel actually looks like, and how the layered software stack maps your Python down to the SASS instructions an SM executes.

A real kernel: SAXPY

SAXPY (Single-precision A·X Plus Y) is the "hello, world" of CUDA: Y = a·X + Y where X and Y are large arrays and a is a scalar. It's the simplest non-trivial kernel that exercises the full system.

__global__ void saxpy(int N, float a, const float* X, float* Y) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N) Y[i] = a * X[i] + Y[i];
}

int main() {
  const int N = 1 << 20;             // 1,048,576 elements
  float *d_X, *d_Y;
  cudaMalloc(&d_X, N * sizeof(float));
  cudaMalloc(&d_Y, N * sizeof(float));

  // ... fill X, Y ...

  int threads = 256;
  int blocks  = (N + threads - 1) / threads;   // 4,096 blocks
  saxpy<<<blocks, threads>>>(N, 2.0f, d_X, d_Y);
  cudaDeviceSynchronize();

  cudaFree(d_X);
  cudaFree(d_Y);
}
            saxpy.cu — the simplest real CUDA program
          

What the GPU does with this

Host submits a launch

The <<<blocks, threads>>> syntax compiles to a runtime call that queues a kernel launch on the GPU's command stream. The CPU returns immediately — the GPU runs asynchronously.
Block scheduler distributes work

The 4,096 blocks are dispatched to the 132 SMs of an H100. Each SM holds multiple blocks resident at once based on resource usage; the rest wait in queue and stream in as earlier blocks finish.
Warps form and execute

Within each block, threads 0–31 form warp 0, 32–63 form warp 1, and so on. The hardware schedules and executes warps, not individual threads.
Blocks land on SMs · warps run SIMT

The grid scheduler distributes blocks across the SMs. Inside each SM, a warp scheduler picks ready warps each cycle and issues them to the CUDA cores. With ~1 million threads and 132 SMs, every SM gets thousands of threads — plenty for latency hiding.
Memory access is coalesced

Because consecutive threads read consecutive array indices, each warp's loads collapse into a single 128-byte HBM transaction. This kernel runs near peak memory bandwidth.

CUDA software layers

CUDA Runtime API

The high-level C/C++ API (cudaMalloc, cudaMemcpy, <<<...>>>). What 99% of users actually write. Comes with a hot-loaded driver underneath.

CUDA Driver API

Lower-level. Lets you load PTX/cubin modules dynamically, manage contexts, do things runtime can't. Used by frameworks like PyTorch internally.

PTX

"Parallel Thread Execution" — NVIDIA's virtual ISA. Like LLVM IR for GPUs. NVCC emits PTX; the driver JITs it down to actual SASS instructions for the specific architecture.

SASS

The actual binary instructions the GPU executes. Architecture-specific (different for Hopper vs Blackwell). What you see in cuobjdump --dump-sass.

cuBLAS / cuDNN / CUTLASS

NVIDIA's hand-tuned libraries. cuBLAS = linear algebra; cuDNN = deep-learning primitives (convolutions, attention); CUTLASS = templatized GEMM kernels you can compose. PyTorch, JAX, and TensorRT call these internally.

Triton · CUTLASS Python

Modern compiler-driven layers that let you write GPU kernels in Python-ish DSLs and get near-cuBLAS performance. Triton was the breakthrough — FlashAttention was first written in it.

The takeaway

CUDA is not just an API — it's a complete ecosystem: hardware (SMs, Tensor Cores) + compiler (NVCC, PTX) + libraries (cuDNN, cuBLAS, NCCL) + frameworks (PyTorch, JAX, TensorRT) + cluster software (Magnum IO, NCCL). The hardware is impressive, but the moat NVIDIA has built around the hardware is even bigger.

Section 02 · The Brain

LLM Inference — what actually happens between your prompt and the first token.

You type a question. Two hundred milliseconds later, words start streaming back. Inside that gap, ten thousand Tensor Core operations have fired, eighty gigabytes of weights have been read out of HBM, and a key-value cache has started growing in memory. Let's walk through every step.

01. Training vs Inference — two very different jobs

Before we dive into how inference works, you have to separate it sharply from training. They share the same model architecture but stress the GPU in completely opposite ways.

TRAINING

Compute-bound · long-running

Run forward pass → compute loss → backpropagate gradients → update weights. Each step processes a batch of hundreds or thousands of sequences in parallel. Storage requirement is huge (optimizer states + gradients + activations), but Tensor Cores stay saturated. A GPT-4-class run takes thousands of GPUs for months.

INFERENCE

Memory-bound · interactive

Just the forward pass. No gradients, no backward. But you do it one token at a time in a loop, and most of the time you're waiting on memory rather than compute. Latency, not throughput, is what users feel. A single user's request can run on 1–8 GPUs for milliseconds-to-seconds.

Why this distinction matters

The bottleneck in training is FLOPS. The bottleneck in inference is memory bandwidth. Two different optimization problems, two different sets of tricks, two different ideal GPU profiles — which is why the H200 (more bandwidth, same compute) was built specifically for inference, and why FP4 (Blackwell) is mostly about inference economics.

02. Tokenization — from your text to numbers

The first thing that happens to your prompt: it gets chopped into tokens. A token is a small chunk of text — usually a sub-word. Modern LLMs use a vocabulary of typically 32K–200K tokens. Each token becomes a single integer ID.

# Example with a real tokenizer
prompt = "The mitochondria is the powerhouse of the cell."

# tiktoken (GPT-4 tokenizer) produces:
token_ids = [791, 93578, 21436, 374, 279, 2410, 7830, 315, 279, 2849, 13]
tokens    = ["The", " mitochondria", " is", " the", " power", "house", " of", " the", " cell", "."]

# Rule of thumb: 1 token ≈ 0.75 English words ≈ 4 characters
            tokenization
          

The tokenizer is built using BPE (Byte-Pair Encoding) or Unigram algorithms during model training — it learns the most efficient splits to compress the training corpus. Once trained, it's a fixed lookup table. Every inference run uses the exact same tokenizer.

"Mitochondria" is one token in GPT-4's tokenizer because it appears often enough in training to earn its own slot. Rarer compound words get split into 2–5 tokens. Non-English text usually packs less densely (more tokens per word), which is why translations to French or Tamil "cost more" per query.

03. Inside a transformer block — the layer that's repeated 80 times

A modern LLM is, structurally, surprisingly simple: an embedding lookup, a stack of identical transformer blocks (often called layers), and an output projection. Llama 3 70B has 80 of these blocks. GPT-3 has 96. They are all variations on the same template.

FIG 03.1 · One transformer block

Pre-norm transformer block with multi-head attention + feed-forward, both wrapped in residual connections. Llama, GPT, Claude all use variants of this exact structure.

The two halves

ATTENTION

"Look at the other tokens in the sequence and decide which ones matter for me." Mixes information across positions. This is where context comprehension happens. ~30–40% of weights.

FEED-FORWARD (MLP)

"Independently process each token." A 2-layer MLP applied to every token's hidden vector. Inflates 4–5× wider in the middle then projects back. ~60–70% of weights. Where most knowledge lives.

RMSNorm

Replaces the older LayerNorm. Normalizes the activations to roughly unit variance. Cheap, no learnable mean — just a per-feature scale.

Residual (skip)

The dashed line. Adds the block's input back to its output. Critical for training deep networks — lets gradients flow back 80+ layers without vanishing.

04. The attention mechanism — in detail

Attention is the single most important algorithm in modern AI. Once you understand its three vectors — Q (query), K (key), V (value) — the rest of inference falls into place.

Analogy · the library

Each token in your sequence walks into a library. It carries a query ("here's what I want to know"). On every shelf is a book labeled with a key ("here's what this book is about") and containing a value ("here's the actual information inside"). The token's query is compared against every key; the keys that score high get their values mixed in. The output for each token is a weighted sum of values, with the weights coming from query-key similarity.

The math, slowly

Each token's hidden vector (~4,096 dimensions in a 70B model) gets projected into three vectors via learned weight matrices W_Q, W_K, W_V:

Q = x · W_Q     # shape: [seq, d_head]    "what am I looking for?"
K = x · W_K     # shape: [seq, d_head]    "what do I represent?"
V = x · W_V     # shape: [seq, d_head]    "what info do I carry?"

# Attention scores: dot every query with every key
scores = Q @ K.T / sqrt(d_head)        # shape: [seq, seq]

# Causal mask: a token can only attend to past tokens (autoregressive)
scores = mask_future_positions(scores)

# Softmax: turn scores into a probability distribution per query
weights = softmax(scores, dim=-1)        # shape: [seq, seq]

# Weighted sum of values
output = weights @ V                    # shape: [seq, d_head]
            scaled dot-product attention
          

Multi-head attention

One set of Q/K/V projections is one "head" of attention — it learns one type of relationship. Real models use multiple heads in parallel, each looking for different patterns (one head might track grammatical agreement, another long-range coreference, another factual recall). Llama 3 70B has 64 attention heads per layer. After all heads run independently, their outputs concatenate and pass through an output projection W_O.

GQA — Grouped-Query Attention

A 2023 optimization that's now standard: instead of having N separate K/V projections for N heads, you share each K/V across a group of heads. Llama 3 70B uses 64 query heads but only 8 KV heads (group size 8). This shrinks the KV cache by 8× with almost no quality loss — an enormous deal for inference, as you're about to see.

Why O(n²)?

Look at the scores = Q @ K.T line. Q and K are both seq × d matrices, so their product is a seq × seq matrix. Every token attends to every other token. Double the sequence length → quadruple the compute and memory. This quadratic scaling is why long contexts are expensive — and why prefill takes so much longer than decode.

05. Prefill vs Decode — the two phases of LLM inference

Inference is split into two phases with completely different performance characteristics. Understanding this distinction is the key to understanding everything about LLM serving.

FIG 05.1 · Prefill vs Decode timeline

Prefill is a single huge GEMM. Decode is hundreds of skinny GEMVs that mostly wait on memory.

Phase 1 · Prefill

The user submits a prompt of, say, 500 tokens. The model needs to compute the hidden representations of all 500 tokens in parallel — because every token's attention output depends on all the previous tokens, and we want every K and V vector available for the upcoming decode steps. This is one massive forward pass that touches all 80 layers.

From the GPU's perspective: huge matrix multiplications, big sequence dimension, plenty of work for Tensor Cores. Compute-bound. Throughput is excellent — you process ~10,000–50,000 prompt tokens per second on a single H100.

At the end of prefill: the model has the first output token, plus a KV cache filled with the K and V tensors of all 500 prompt tokens.

Phase 2 · Decode

Now the model generates one token at a time, in a loop. For each new token:

Take only the last token (just generated) and compute its hidden vector.
Project to Q, K, V for that one token.
Append the new K and V to the cache.
Compute attention: this token's Q against all K vectors in the cache.
Continue through the remaining layers, produce one output token.
Sample (e.g. top-p) to pick the next token. Loop.

The compute per step is tiny — you're doing matrix-times-vector instead of matrix-times-matrix. But you have to read all 80 layers of weights from HBM for every single output token. For a 70B model in FP16, that's ~140 GB of weights to read per token. On a B200 with 8 TB/s, the theoretical floor is ~17.5 ms per token just from memory bandwidth — before any compute. Memory-bound.

The two metrics users actually feel

TTFT (Time To First Token) = how long they wait staring at "..." before words start appearing. Dominated by prefill cost. Long prompts → big TTFT.
TPOT (Time Per Output Token) = how fast words stream. Dominated by decode memory bandwidth. The reason a 70B model on an H100 streams at ~30 tok/s, while the same model on a B200 streams at ~80 tok/s — the B200 has 2.4× more bandwidth.

06. The KV Cache — inference's most important data structure

Of all the engineering choices that make LLM inference practical, the KV cache is the most important. Without it, generating each new token would require reprocessing the entire conversation from scratch — making inference quadratic in output length and effectively unusable for long generations.

Why it exists

Recall: to compute attention for the new token, we need its Q vector dotted against the K vectors of every previous token. But the K and V vectors of previous tokens never change — the model weights don't change, the previous tokens don't change, so their K and V will always be the same. So compute them once at prefill, then keep them around. That's the cache.

# Per layer · per attention head:
KV_cache[layer][head] = {
    K: tensor([seq_len, d_head]),   # grows by 1 row each decode step
    V: tensor([seq_len, d_head])    # grows by 1 row each decode step
}

# Size in bytes for the whole cache:
kv_bytes = 2 × num_layers × num_kv_heads × d_head × seq_len × bytes_per_element

# Example: Llama 3 70B, FP16, 8K context
#   = 2 × 80 × 8 × 128 × 8192 × 2 = 2.6 GB per sequence

# Example: Llama 3 70B, FP16, 128K context
#   = 2 × 80 × 8 × 128 × 131072 × 2 = 42 GB per sequence (!)
            KV cache size math
          

The pain it causes

It's enormous. For a 70B model at 128K context, the KV cache for one user can take 40+ GB. On an 80 GB H100, that leaves only 0–5 GB for batching other requests.
It must stay in HBM. You can't page it out — every decode step reads the entire cache.
It's per-sequence. 100 concurrent users? 100 separate caches.
It grows over time. Every decode step appends one row to every layer's K and V.

The 80% problem

On a busy inference server, the KV cache often consumes 80% or more of the GPU's HBM. Squeezing it — through GQA, sliding-window attention, quantization, or paged allocation — is one of the most active areas of LLM serving research.

Three big optimizations

GQA · Grouped-Query Attention

Use fewer K/V heads than Q heads, sharing across groups. Llama 3 has 8 KV heads vs 64 Q heads = 8× cache reduction. Standard in nearly every modern model.

PagedAttention vLLM

Allocate the cache in small fixed-size pages, like virtual memory in an OS. No more wasted contiguous reservations. Increased throughput by 2–4× when introduced.

KV quantization

Store K and V at FP8 or INT8 instead of FP16. Halves cache memory; minor quality loss. Especially valuable for long-context inference.

07. Why decode is memory-bound — arithmetic intensity

Here's the most counterintuitive fact in LLM serving: decode wastes most of your Tensor Cores. A B200 can do 9 PFLOPS of FP8 math. During decode, you might use 1–5% of that. Why? Arithmetic intensity.

Definition

Arithmetic intensity = FLOPs performed ÷ bytes read from memory. It tells you whether a kernel is compute-bound (high intensity) or memory-bound (low intensity).

For a matrix-times-matrix (M @ N where both are large): high intensity. Each weight byte gets reused dozens of times. Compute-bound. Tensor Cores happy.

For a matrix-times-vector (a single decode step): low intensity. Each weight byte gets used exactly once, then thrown away. Memory-bound. Tensor Cores starve.

The math (and why batching saves us)

To compute one decode step at batch size 1:

You read ~140 GB of weights (Llama 3 70B in FP16).
You do ~140 GFLOPS of math (2 × 70B params).
Arithmetic intensity ≈ 1 FLOP per byte. Terrible.

Now batch 32 requests together:

You still read ~140 GB of weights once (they're shared!).
You do ~140 × 32 = 4,480 GFLOPS of math.
Arithmetic intensity ≈ 32 FLOPS per byte. Much better.

This is the entire economic motivation for inference serving: batch requests aggressively to amortize the weight-load cost. On a busy inference server, the same H100 that produces 30 tok/s for one user produces 2,000 tok/s aggregate across 100 users — almost the same total work, just distributed across the batch.

The B200's FP4 advantage

At FP4, the weights of a 70B model occupy ~35 GB instead of 140 GB. On B200's 8 TB/s of bandwidth, that's just ~4.4 ms to read — meaning you can theoretically produce 200+ tokens/sec from a single 70B model. This is why Blackwell FP4 is being called "transformative for inference economics."

08. End-to-end flow — from your keystroke to streaming tokens

Let's trace one complete inference request from the moment you hit "Send" through the moment the model writes "." and stops.

Request arrives at the inference server

An HTTP POST hits your vLLM / TensorRT-LLM / SGLang server. The body has your prompt and sampling parameters (temperature, top-p, max_tokens).
Tokenizer converts text to integer IDs

"Tell me a joke" → [37286, 757, 264, 22380]. Done on CPU; takes a fraction of a millisecond.
Embedding lookup

Each token ID indexes into a (vocab_size × hidden_dim) embedding matrix. Pulls out a hidden vector for each token. This whole table lives in GPU HBM.
Prefill — forward pass through all 80 layers

For each layer: RMSNorm → multi-head attention (with the entire prompt visible to itself) → residual → RMSNorm → MLP → residual. Tensor Cores fully engaged. KV cache is being filled with the K and V tensors of every prompt token.
Final RMSNorm and LM-head projection

Take the hidden vector of the last prompt token, normalize, multiply by a (hidden_dim × vocab_size) projection matrix. This produces logits — one score for every token in the vocabulary.
Sampling produces the first output token

Apply temperature scaling, top-p truncation, sample. Now you have token #1 of the response. This is the TTFT moment — it streams back to the user.
Decode loop begins

Take that just-generated token. Embed it. Forward pass — but now only one position. At each layer, append its K and V to the cache, then compute attention of this one Q vector against the entire cached K matrix. Produce next logits → sample → next token.
Stream the new token to the user

Through WebSocket, server-sent-events, or HTTP/2. The user sees a new word arrive every ~30 ms on H100, ~12 ms on B200.
Stop condition

Loop until the model emits an end-of-sequence token, or you hit max_tokens, or the user disconnects. Then release the KV cache slots and the GPU is ready for the next request (or has been serving 99 others all along).

09. FlashAttention — the kernel that changed inference

In 2022 a Stanford PhD student named Tri Dao published a paper that, more than any other single piece of software, made long-context LLMs possible. The trick was to rewrite the attention kernel to never materialize the seq×seq attention matrix in HBM.

The problem with naive attention

Standard attention has this structure:

S = Q @ K.T         # [seq, seq] · writes HUGE matrix to HBM
P = softmax(S)      # [seq, seq] · reads + writes again
O = P @ V           # reads [seq, seq] one more time
            naive attention · O(n²) memory
          

For seq=8192, that intermediate S matrix is 8192² × 2 bytes = 134 MB per head, per layer. Multiplied across heads and layers, you're moving gigabytes of attention scores in and out of HBM — even though those scores are never used after softmax.

The FlashAttention trick

FlashAttention tiles the computation: load a block of Q, a block of K, a block of V into SRAM (the SM's L1/shared memory), compute attention for that tile only, accumulate partial softmax statistics, and never write the full S matrix to HBM at all. With careful online softmax math (running max and running sum), you can stream through the K and V dimensions and produce the correct output with O(n) memory instead of O(n²).

Why it works on the GPU

SRAM is ~50× faster than HBM. By keeping intermediate work in SRAM and only writing the final output to HBM, FlashAttention is both faster (2–4×) and uses less memory (linear instead of quadratic in seq_len). It's now the default attention kernel in PyTorch (via F.scaled_dot_product_attention), TRT-LLM, vLLM, and basically every modern inference engine.

FlashAttention v2 (2023) improved warp scheduling. FlashAttention v3 (2024) added native FP8 support for Hopper, using the new asynchronous TMA engine and warp-group MMA. By 2025 it had been ported to Blackwell with TMEM-optimized variants.

10. PagedAttention & vLLM — treating the KV cache like virtual memory

The 2023 paper from UC Berkeley that introduced PagedAttention — and the inference engine vLLM built around it — changed how every serious inference server allocates memory.

The problem before

Old inference engines pre-allocated a contiguous chunk of KV cache for each request based on its maximum possible length. If you asked for 4096 max_tokens but the response stopped at 50 tokens, you wasted ~99% of that reservation — for the entire duration of the request. Fragmentation made batch sizes much smaller than the GPU could theoretically support.

The PagedAttention idea

Borrow virtual memory from operating systems. Allocate the KV cache in small fixed-size blocks (typically 16 tokens each). Maintain a per-sequence block table that says "logical position 0–15 → physical block #42, position 16–31 → block #105...". When a sequence finishes, its blocks return to the pool; when a new request arrives, it grabs blocks as needed.

No reservation waste — you only consume memory for tokens you actually generated.
Near-zero fragmentation — all blocks are the same size.
Prefix sharing — if 100 requests share the same system prompt, they can point to the same physical KV blocks for that prefix. This is enormous for chat apps with long system prompts.
2–4× throughput over the previous state of the art.

vLLM, SGLang, and TensorRT-LLM all now use page-based KV cache management. If you serve LLMs at scale, you are almost certainly using PagedAttention — even if you don't realize it.

11. Continuous batching — the throughput multiplier

In a traditional batched system, you collect N requests, run them in lockstep until all finish, then start the next batch. Some requests finish in 10 tokens; others in 1,000. The slow ones hold the batch hostage — the GPU is mostly idle during the tail.

Continuous batching (also called "in-flight batching" or "iteration-level scheduling") fixes this. The scheduler operates one decoder step at a time:

Look at all in-progress requests. They're at different KV-cache lengths.
Form a single batched decode step that includes every active sequence's current token.
Run one forward pass; produce next tokens for all.
For any sequence that emitted EOS, evict it. For any new request waiting in the queue, slot it in.
Repeat.

The batch composition changes every step. The GPU stays maximally utilized. This combined with PagedAttention is what unlocks the 1,000+ tok/sec aggregate throughput numbers you see in vLLM/SGLang/TRT-LLM benchmarks.

12. Quantization — how 70B fits on one GPU

The Llama 3 70B model, in its native FP16 format, weighs ~140 GB. That doesn't fit on an H100's 80 GB. So how does anyone serve it on one GPU? Quantization.

The general idea

Trained model weights have specific statistical structure: most are small, mean-zero numbers in a narrow range. You don't need 16 bits of precision to represent them — 8 bits, or even 4 bits, can preserve almost all of the model's behavior if you're careful about how you map the numeric range.

FP16 / BF16

2 bytes per weight. The native training format. Llama 3 70B = 140 GB. Where every model starts.

INT8 · FP8

1 byte per weight. 2× compression. INT8 uses fixed-point math; FP8 uses NVIDIA's E4M3/E5M2 floating formats. Negligible quality loss with modern calibration. Llama 3 70B → ~70 GB. Fits on one H100.

INT4 · NF4 · NVFP4

0.5 bytes per weight. 4× compression. Requires more sophisticated techniques (GPTQ, AWQ, NVFP4 two-level scale). Llama 3 70B → ~35 GB — fits on a single RTX PRO 6000 or even an RTX 4090 (24 GB) for smaller models.

GPTQ · AWQ

Post-training quantization algorithms. Use a small calibration dataset to choose per-channel scales that minimize quantization error. AWQ also identifies "salient" weights to keep higher-precision.

What's actually quantized

The biggest gains come from quantizing the linear layers' weight matrices, which is where 99% of the model's parameters live. Other components — LayerNorm, embeddings, attention scores — are typically left in higher precision because they're a small fraction of memory and very sensitive.

NVFP4 — the new gold standard

NVIDIA's NVFP4 format (Blackwell) is two-level: every block of 16 weights gets its own FP8 scale, plus a per-tensor FP32 scale on top. This preserves the dynamic range that vanilla FP4 loses, getting within ~1% of FP8 accuracy at 4× the throughput. Expected to become the default inference format for 2026 frontier models.

13. Speculative decoding — getting tokens for free

An elegant idea from 2023 that's now everywhere: use a small fast model to guess, use the big model to verify in parallel.

Draft phase

A small "draft" model (e.g. a 1B model that runs at 200+ tok/s) generates the next 4–8 tokens cheaply and quickly.
Verify phase

The big "target" model (70B) runs one forward pass treating all draft tokens as if they were the input. This is essentially a prefill on a tiny extra sequence — almost free since the big model is memory-bound anyway. It produces logits for the next-token distribution at every drafted position.
Accept / reject

For each drafted token, compare the big model's probability for that token vs the draft model's. Use a rejection-sampling rule that mathematically guarantees the resulting distribution is identical to what the big model would have produced alone. Accept the longest prefix that passes; restart drafting from after the first rejection.

Net effect: when the draft is accurate (most of the time on common text), you get 2–4 tokens per big-model forward pass instead of 1. Same quality as the big model, ~2–3× faster end-to-end. This is the technique behind many modern fast-mode inference APIs.

14. Tensor parallelism & pipeline parallelism — when one GPU isn't enough

For models larger than a single GPU's memory (the Llama 3 405B, GPT-4-scale models, etc.), you have to split the model across GPUs. Two main strategies:

Tensor Parallelism (TP)

Split each layer across GPUs

Each layer's weight matrix is sharded column-wise (or row-wise) across N GPUs. Every GPU does 1/N of the matmul, then they all-reduce the partial results via NVLink. Requires very fast interconnect (this is why NVLink exists). Used inside a single 8-GPU node.

Pipeline Parallelism (PP)

Split layers across GPUs

GPU 0 has layers 0–19, GPU 1 has 20–39, etc. Activations flow from one to the next like an assembly line. Higher latency but tolerates slower interconnect (PCIe, InfiniBand). Used across 8-GPU nodes when one node isn't enough.

Real-world large-scale serving combines both: TP within a node (across 8 GPUs over NVLink) + PP between nodes (over InfiniBand). On a GB200 NVL72 rack, NVLink stretches across all 72 GPUs so you can do TP at much larger scale than ever before — up to TP=72 inside a single rack.

For inference, there's also expert parallelism for mixture-of-experts models: different experts of the MoE layer live on different GPUs, and tokens get routed across the network to the experts they need.

15. Throughput vs latency — the central tradeoff

Every inference serving decision ultimately routes through this dichotomy.

Throughput-Optimized

Large batches · slower per user

Batch 128+ requests together. Aggregate throughput: 5,000+ tok/s on one H100. Per-user latency: ~80 ms/token. Great for offline batch jobs, RAG document processing, fine-tuning data generation, evaluation runs.

Latency-Optimized

Small batches · fast per user

Batch 1–8 requests. Aggregate: 200–800 tok/s. Per-user latency: ~15–30 ms/token. Required for chat UX, voice agents, anything where humans are waiting.

The metrics that actually matter

TTFT

Time To First Token. How long the user waits before seeing anything. Dominated by prefill. Long prompts → big TTFT. Target: <500 ms.

TPOT / ITL

Time Per Output Token / Inter-Token Latency. The streaming "speed". Below ~50 ms (= 20 tok/s) feels human-like for reading; above 200 ms feels sluggish.

Throughput (tok/s)

Aggregate tokens served per second across all users. The number you care about for cost-per-million-tokens.

Goodput

Throughput under a latency SLA. "How many tokens/sec can I serve while keeping TPOT under 50 ms?" The real-world metric — raw throughput is meaningless if everyone times out.

2026 inference stack (where the field landed)

Pick one of vLLM, SGLang, or TensorRT-LLM. All three use PagedAttention + continuous batching + FlashAttention internally. On H100/H200 they deliver 2,000–3,000 tok/s aggregate for 70B-class models. On B200 with FP4, frontier numbers approach 10,000 tok/s. The bottleneck has moved firmly from "can we serve this model" to "can we serve it within latency SLAs at a profit".

Section 03 · The Eye

Graphics Pipeline — from triangle to pixel, at sixty per second.

Before CUDA, before Tensor Cores, before the AI boom — the GPU existed to draw pictures. The path from "here are 5 million triangles" to "here is a 4K image" is one of the most beautiful pieces of engineering in modern computing. Let's walk every stage, then see how ray tracing and neural rendering fit alongside it.

01. The big picture — what we're even doing

The graphics pipeline is a function: it takes a scene description (a list of triangles, materials, lights, and a camera) and produces a 2D image. That's it. The complexity comes from doing it 60+ times per second, with millions of triangles, billions of pixels' worth of work, and physically plausible lighting.

FIG 01.1 · The classic rasterization pipeline

The classic real-time rasterization pipeline. Each colored stage runs on the GPU; the programmable ones execute shaders you write.

Two crucial facts about the pipeline

It is massively parallel. Every vertex is independent of every other vertex. Every pixel is (mostly) independent of every other pixel. The GPU spawns one CUDA-thread-per-vertex during the vertex stage and one CUDA-thread-per-pixel during the fragment stage — thousands of threads running the same shader simultaneously.
The "shaders" you write are CUDA kernels in disguise. A vertex shader, a fragment shader, a compute shader — they all compile down to PTX/SASS and run on the same SMs and CUDA cores. The graphics API (DirectX, Vulkan, Metal) is just a different front-end to the same hardware.

02. Vertex stage — from object space to clip space

Each model in the scene starts life as a list of vertices — 3D positions, possibly with extra data attached (normal, texture coordinates, vertex colors). Those positions are in the model's own coordinate system — a chair model has its origin at the chair's feet; a character model has its origin between the feet. We need to put everything into one common coordinate system, project it onto the screen, and tell the GPU where on the 2D image each vertex ends up.

The four coordinate spaces

Model space

The coordinates baked into the asset file. The chair's leg is at (0.2, 0, 0.5). Local to that mesh.
World space

Multiply by the model matrix — this is where the chair sits in the room. The leg might now be at (12.4, 0, -3.1) in world coordinates.
View (camera) space

Multiply by the view matrix. The world is now expressed relative to the camera — "the chair leg is 4 meters in front of me and 2 meters to my right." The camera is always at the origin, looking down -Z.
Clip space → screen space

Multiply by the projection matrix. This handles the perspective foreshortening — far things shrink. Then divide by W (perspective divide) to get normalized device coordinates from -1 to +1, then scale to pixel coordinates of the framebuffer.

// A typical vertex shader (HLSL)
struct VS_OUTPUT {
    float4 position : SV_POSITION;   // in clip space
    float3 worldNormal : NORMAL;
    float2 uv : TEXCOORD;
};

VS_OUTPUT VS_main(float3 inPos : POSITION, float3 inNormal : NORMAL, float2 inUV : UV) {
    VS_OUTPUT o;
    float4 world = mul(modelMatrix, float4(inPos, 1));
    float4 view  = mul(viewMatrix, world);
    o.position = mul(projMatrix, view);    // final clip-space position
    o.worldNormal = mul((float3x3)modelMatrix, inNormal);
    o.uv = inUV;
    return o;
}
            vertex shader · HLSL
          

The vertex shader runs once per vertex. On a model with 100K vertices, that's 100K threads spawned in groups of 32 (warps), spread across the SMs. Each vertex's work is independent, so they all run in parallel and finish almost instantly.

03. Tessellation & geometry shaders — making more triangles

Optional stages that happen after the vertex shader and before rasterization. Both have the power to create new geometry on the GPU — rather than just transforming what was sent in.

Tessellation

Takes a coarse patch (a quad or triangle) and subdivides it into many small triangles, with per-patch control over how much subdivision. Used for terrain LOD, smooth curved surfaces from low-poly cages, and adding fine detail only where the camera is close.

Geometry shader

Runs once per primitive (triangle/line) and can emit zero or more new primitives. Used for things like generating fur fins, particle billboards, shadow volumes. Historically slow on most hardware — largely replaced by mesh shaders.

Modern engines like Unreal 5 use these stages less and less, replacing them with mesh shaders and Nanite-style virtualized geometry. We'll get to those.

04. Rasterization — triangles become pixel-fragments

Now we have transformed triangles in screen space. Rasterization is the fixed-function stage that figures out which pixels each triangle covers and generates a "fragment" for each covered pixel. A fragment is a potential pixel — not yet a final color, but a position plus interpolated attributes (UV, normal, depth) inherited from the vertices.

What rasterization actually does

Triangle setup

Given the 3 screen-space vertices, compute the triangle's bounding box and edge equations. The edge equation tells you, given any pixel center (x, y), which side of each edge that pixel is on.
Coverage testing

For every pixel inside the bounding box, evaluate the three edge equations. If the pixel is on the inside of all three edges, the triangle covers it. Modern hardware does this for 16–64 pixels per cycle per raster engine.
Attribute interpolation

For every covered pixel, compute barycentric coordinates — weights for how much each of the 3 vertices contributes. Use those to interpolate UV, normal, color, depth, etc. The fragment shader sees these per-pixel interpolated values, not the raw vertex data.
Early-Z (depth test)

Modern GPUs try to reject fragments before running the fragment shader by checking depth: if the depth buffer already has a closer surface at that pixel, this fragment will lose — skip it. Saves enormous amounts of fragment shading work, especially for crowded scenes.

Each GPC has a raster engine

Rasterization is one of the few things on a modern GPU that isn't a CUDA core operation — it's done by dedicated hardware called a raster engine, one per GPC. The H100 has 8 raster engines; the RTX PRO 6000 has more. Each can rasterize triangles independently in parallel.

05. The fragment shader — where lighting happens

The fragment shader (called "pixel shader" in DirectX, "fragment shader" in OpenGL/Vulkan) runs once per fragment — that is, roughly once per pixel covered by a triangle, though pixels can be covered by multiple triangles (overdraw). This is where the majority of a frame's GPU time is spent in modern games.

What it computes

Given the interpolated per-fragment data — world-space position, surface normal, texture coordinates — the fragment shader produces a final RGBA color. This is where:

Texture sampling happens (look up the base color, normal map, roughness, metallic maps from textures).
Lighting is computed (sum contributions from all light sources, accounting for shadows, occlusion, indirect light).
Material BRDF is evaluated (how the surface reflects light — matte plastic vs polished metal vs subsurface skin).
Post-shader effects for that pixel run (transparency, fog, screen-space reflections).

// Simplified PBR fragment shader (HLSL)
float4 FS_main(VS_OUTPUT input) : SV_TARGET {
    float3 baseColor = baseColorTex.Sample(samp, input.uv).rgb;
    float  metallic   = roughnessTex.Sample(samp, input.uv).b;
    float  roughness  = roughnessTex.Sample(samp, input.uv).g;
    float3 N = normalize(input.worldNormal);
    float3 V = normalize(cameraPos - input.worldPos);

    float3 Lo = float3(0,0,0);
    for (int i = 0; i < numLights; ++i) {
        float3 L = normalize(lights[i].pos - input.worldPos);
        float3 H = normalize(L + V);
        float  NdotL = max(dot(N, L), 0);

        // Cook-Torrance BRDF: D · F · G / (4 · NdotL · NdotV)
        float3 brdf = CookTorrance(N, V, L, H, baseColor, metallic, roughness);
        Lo += brdf * lights[i].color * NdotL;
    }
    return float4(Lo, 1.0);
}
            fragment shader · PBR lighting
          

How it maps to the GPU

Fragment shaders execute in 2×2 pixel quads — the smallest unit. Each warp of 32 threads handles 8 such quads (= 32 fragments). The hardware groups quads in a way that maximizes the chance that consecutive threads in a warp are working on adjacent screen pixels — which is exactly what you want for cache locality on texture lookups.

Why 2×2 quads?

Texture sampling needs derivatives (ddx, ddy) to pick mip levels for filtering. The hardware computes derivatives via finite differences across the quad — if the right pixel's UV is 0.51 and the left is 0.50, the derivative is 0.01. That's why fragments always come in 2×2 groups even at the edges of triangles (where some lanes are "helper" lanes that get masked off at the end).

06. Output Merger & ROPs — writing pixels into the framebuffer

The fragment shader produces a color, but that's not the end of the story. The Raster Operation Pipeline (ROP) is the fixed-function unit that does the final mixing and writing of pixels into the framebuffer.

ROP responsibilities:

Depth test — if not already done by Early-Z. Compare this fragment's depth against the depth buffer; discard if behind.
Stencil test — arbitrary masking. Used for shadow volumes, decals, UI clipping.
Alpha blending — mix this fragment's color with whatever is already in the framebuffer (for transparent objects).
MSAA resolution — if multi-sample anti-aliasing is on, combine multiple sub-pixel samples.
Write to render target — finally update the framebuffer with the new pixel color.

Once every fragment from every triangle has gone through ROPs, the render target contains the final image. The GPU does a final pass to copy/blit it to the display, the monitor scans it out, and your eye sees a frame.

07. Textures & sampling — the hidden hot path

Textures are the data backbone of modern graphics: they hold base colors, normal maps, roughness, displacement, environment lighting, and a hundred other channels. A modern game scene easily has 5–20 GB of textures referenced. Sampling them well is a huge fraction of fragment-shader cost.

The texture pipeline

Texture units

Dedicated hardware inside each SM. Performs filtering (bilinear, trilinear, anisotropic), boundary handling, mipmap selection, and decoding compressed texture formats — all for free per cycle.

Mipmaps

Pre-computed smaller copies of each texture (½, ¼, ⅛...). The hardware picks the right mip level based on how zoomed-out the surface appears, avoiding aliasing on distant detail.

Compression

BC1–BC7 / ASTC / Blackwell's Neural Texture Compression. Textures live in HBM in a 4–8× compressed form and decompress on read in dedicated hardware. Essential for fitting modern asset budgets.

Anisotropic filtering

When a surface is viewed at a steep angle, take more samples along the elongated axis to keep distant details crisp. 16× anisotropic is now essentially free on modern GPUs and one of the simplest quality wins.

Blackwell's Neural Texture Compression (NTC) uses tiny neural networks (a few hundred bytes) to reconstruct texture data, getting ~7× compression versus BC7 with comparable quality. It's a glimpse of where rendering is heading: neural networks inside the rendering pipeline, not just as post-processing.

08. Rasterization vs ray tracing — two opposite philosophies

Everything we've covered so far is rasterization: project triangles onto pixels and shade them. It's fast, it's been optimized for 30 years, but it can't directly answer questions like "what color is the reflection of this surface in the puddle?" or "what does this ceiling look like if the only light comes from a window I can't see from this pixel?"

Ray tracing answers exactly those questions. It works backward: for each pixel, shoot a ray from the camera into the scene; see what it hits; possibly bounce more rays to gather lighting info; assemble the final color.

RASTERIZATION

For each triangle, find the pixels

Local shading. Sees the surface and the lights, doesn't see the rest of the world. Cheap, deterministic, decades of optimization. Tricks like shadow maps, screen-space reflections, environment maps approximate global effects. Limited fidelity, but blazing fast.

RAY TRACING

For each pixel, find what it sees

Global shading. Each ray can interact with anything in the scene. Naturally handles shadows, reflections, refractions, global illumination, caustics. Physically correct. Computationally expensive — until RT cores and AI denoising made it real-time-feasible around 2018.

Today's reality: hybrid rendering

Modern games (Cyberpunk 2077, Alan Wake 2, Indiana Jones, etc.) use hybrid rendering: rasterize most of the frame, then trace rays selectively for the parts that benefit most — reflections, soft shadows, ambient occlusion, global illumination. The best of both worlds.

09. The ray tracing pipeline — five new shader types

Just as rasterization has vertex/fragment shaders, ray tracing has its own programmable pipeline. NVIDIA's DXR (DirectX Raytracing) / Vulkan KHR Ray Tracing exposes five shader types:

Ray generation

The entry point. Runs once per pixel. Calls TraceRay() to launch primary rays into the scene. You typically loop here to fire multiple rays per pixel for anti-aliasing.

Intersection

Runs when a ray potentially hits a custom primitive (not a triangle). For built-in triangle geometry, the RT core handles this in fixed hardware. Custom intersection lets you ray-trace spheres, voxels, etc.

Any-hit

Runs for every potential hit along the ray (not just closest). Used for transparency and stochastic clipping — "this hit is on an alpha-masked leaf, ignore it and keep going".

Closest-hit

Runs after the closest intersection is determined. Has access to hit position, normal, material. Computes shading there — including possibly firing more rays for reflections, indirect light. This is where most lighting logic lives.

Miss

Runs when the ray hits nothing. Typically samples a sky / environment map and returns that color — "the ray escaped to the sky, here's the sky color in that direction."

// Ray generation shader pseudo-HLSL
[shader("raygeneration")]
void RayGen() {
    uint2 pixel = DispatchRaysIndex().xy;
    RayDesc ray = cameraRayFor(pixel);

    Payload payload = (Payload)0;
    TraceRay(sceneBVH, RAY_FLAG_NONE, 0xFF, 0, 0, 0, ray, payload);

    outputImage[pixel] = payload.color;
}

// Closest-hit: surface shading + bounce
[shader("closesthit")]
void ClosestHit(inout Payload payload, BuiltInTriangleAttribs attribs) {
    float3 hitPos = WorldRayOrigin() + RayTCurrent() * WorldRayDirection();
    float3 N = interpolateNormal(attribs);

    payload.color = directLighting(hitPos, N);

    // Fire a reflection ray
    if (payload.depth < 2) {
        RayDesc reflectRay = {hitPos, 0.001, reflect(WorldRayDirection(), N), 1e30};
        Payload reflPayload;  reflPayload.depth = payload.depth + 1;
        TraceRay(sceneBVH, RAY_FLAG_NONE, 0xFF, 0, 0, 0, reflectRay, reflPayload);
        payload.color += 0.5 * reflPayload.color;
    }
}
            DXR ray tracing shaders
          

10. BVH and RT cores — why hardware ray tracing is fast

The core challenge of ray tracing is the intersection test: given a ray, find the closest triangle (out of millions) it hits. A naive linear search through every triangle would be hopelessly slow. The answer is a tree of bounding boxes — the Bounding Volume Hierarchy.

What a BVH is

A binary tree where each node is an axis-aligned bounding box (AABB) that contains all geometry beneath it. The root box contains the entire scene. Each level subdivides: left child holds half the geometry, right child holds the other half. Leaf nodes contain individual triangles (typically a handful per leaf).

How a ray traverses it

Test ray against root AABB. If miss, the ray hits nothing — done.
If hit, descend into both children. Test ray against each child's AABB.
For any child whose box is hit, recurse. Skip subtrees whose boxes are missed.
At a leaf, test the ray against the actual triangles inside.
Keep track of the closest hit so far; if a new candidate's box starts further than the current closest hit, skip it (early termination).

For typical scenes, ray traversal visits O(log N) boxes instead of touching N triangles — turning a million-triangle query into ~20 box tests plus a handful of triangle tests.

RT cores: this traversal in silicon

Each RT core has dedicated hardware for two operations:

Box test — given a ray and an AABB, determine intersection. Done in ~1 cycle.
Triangle test — given a ray and a triangle, compute intersection point. Also fast in fixed hardware.

Crucially, the RT core handles BVH descent asynchronously from the SM's CUDA cores. When you call TraceRay(), the RT core does the heavy lifting while your CUDA threads can sit idle (or, with shader-execution-reordering, do other work). The SM only re-engages when the hit point and material are known, to run your closest-hit shader.

Blackwell's 4th-gen RT core

Doubles the ray-triangle intersect rate of Ada. Adds RTX Mega Geometry: BVH compression and clustering tricks that allow up to 100× more ray-traced triangles in scene — matching the geometric density of film VFX assets in real-time games.

11. Hybrid rendering — the practical compromise

Even with RT cores, full path tracing at 4K at 60 FPS would require ~100+ rays per pixel per frame and crush even a B200. So shipping games make pragmatic choices.

Typical hybrid pipeline (Unreal 5 / Frostbite / Decima)

G-buffer pass (rasterization)

Rasterize the whole scene to a multi-channel buffer holding per-pixel position, normal, base color, roughness, metallic. Fast.
Direct lighting (rasterization + shadow maps)

For each light, render a shadow map and use traditional shadow filtering. Or for soft sun shadows, fire 1–4 rays per pixel from the G-buffer.
Ray-traced reflections

For mirror-like surfaces, fire 1 ray per pixel from the G-buffer along the reflection direction. Shade what it hits with a simplified BRDF. Denoise heavily.
Ray-traced global illumination

A few diffuse rays per pixel, accumulated temporally across frames and spatially across neighbors. Denoised. This is the heart of "RTX-on" in modern engines.
Composite, tone map, post-process

Bloom, motion blur, depth of field, color grading, then output.
AI upscaling (DLSS / FSR / XeSS)

If rendering at 1080p internal, upscale to 4K using a trained network running on the Tensor Cores. The single biggest performance lever in modern games.

12. DLSS & neural rendering — AI inside the pipeline

The last few years have seen neural networks creep into nearly every stage of rendering, running on the GPU's Tensor Cores alongside the traditional graphics work. The most visible example: DLSS (Deep Learning Super Sampling).

DLSS Super Resolution

Render the game internally at 1080p or 1440p. A small CNN/transformer running on Tensor Cores upscales each frame to 4K, using motion vectors and the previous frame for temporal coherence. Looks indistinguishable from native 4K, runs 2–3× faster.

DLSS Frame Generation

Generates entirely new frames between rendered ones. The game renders at, say, 60 FPS; DLSS interpolates 2–3 intermediate frames per real frame, displaying at 180+ FPS. Introduced with Ada Lovelace; refined in Blackwell.

Ray Reconstruction

Replaces traditional denoising filters with a neural denoiser. Critical for clean path-traced output at low samples-per-pixel. Especially powerful for Path Traced games like Cyberpunk's "Overdrive" mode.

Neural shaders Blackwell

Blackwell SMs allow neural networks to run inside programmable shaders — not just as a post-process. Tiny networks can replace specific subroutines: BRDF evaluation, light field interpolation, importance sampling. The future of rendering will likely be a hybrid of analytic and neural code paths.

Where this is heading

Five years ago, the Tensor Cores on a gaming GPU were "wasted" silicon when you weren't using DLSS. Today they're constantly active. By 2028, "rendering a frame" will probably mean: rasterize a sparse set of pixels, ray trace some hard cases, and let a neural model fill in the rest. The hard line between "graphics" and "AI" silicon is dissolving.

13. Mesh shaders — the new geometry pipeline

The traditional vertex → tessellation → geometry → rasterization pipeline has a fundamental scaling limitation: every vertex must be processed individually, and the GPU has limited control over which geometry to load. Mesh shaders, introduced with Turing (2018) and refined since, replace those early stages with a more flexible compute-shader-like model.

Instead of one vertex shader per vertex, you write:

An amplification (task) shader that decides which "meshlets" to process based on view-dependent culling.
A mesh shader that takes a meshlet (a small triangle cluster, ~64–128 triangles) and emits its vertices and primitives in batch — with full programmable control over LOD, culling, and topology generation.

This is the foundation of Unreal Engine 5's Nanite — a system that streams pixel-resolution geometry, dynamically choosing how detailed each meshlet should be based on the camera. Without mesh shaders, Nanite would be impossible.

14. One frame, end-to-end — the whole journey

Let's tie everything together. You press W in a game. Your character walks forward. Sixteen milliseconds later, your monitor displays a new frame showing the world from a slightly different angle. Here is everything that happened on the GPU:

CPU prepares the frame's command list

Game logic updates physics, AI, animation. The engine walks the scene graph, frustum-culls, builds draw calls. All of this on the CPU. Submits a command buffer to the GPU through DirectX 12 / Vulkan.
GPU executes the command buffer

The GPU's command processor (a tiny CPU on the GPU) reads commands and orchestrates: bind these textures, launch this draw call, set this depth state. Work is dispatched to the SMs.
Shadow map passes

For each light that casts shadows, rasterize the scene from the light's point of view into a depth-only texture. Done with the standard pipeline but with a trivial fragment shader (just outputs depth).
G-buffer (deferred shading) pass

Rasterize everything visible from the camera into multiple render targets: position, normal, base color, roughness/metallic. Vertex shaders transform; rasterizers cover pixels; fragment shaders sample textures and write to the G-buffer.
Ray-traced lighting passes

Launch RT pipelines for reflections, GI, soft shadows. RT cores walk BVHs; closest-hit shaders compute shading; results accumulate temporally.
Compute-shader denoising

Run AI denoisers (Ray Reconstruction) on the noisy RT outputs. Tensor Cores spin up.
Composite + post-processing

Combine G-buffer + lighting buffers + transparent objects. Run bloom, depth of field, motion blur, tone mapping, color grading — all as compute shaders.
DLSS upscale + frame generation

Take the rendered 1440p frame; upscale to 4K with a Tensor Core network. Optionally interpolate intermediate frames.
Present

Hand the final 4K image to the display engine. Wait for vsync; scan out to the monitor.

From W keypress to photon hitting your eye: ~30 ms on a modern PC. Inside that 30 ms, every system we've discussed in this guide — CUDA cores running shaders, Tensor Cores running DLSS, RT cores walking BVHs, ROPs blending pixels, the memory hierarchy feeding all of them — runs in tight orchestrated parallel. The whole guide collapses into one rendered frame.

The unification

Notice how AI and graphics have merged. A modern frame uses CUDA cores (shading), Tensor Cores (DLSS, ray reconstruction, neural texture decompression), and RT cores (ray-triangle traversal) all simultaneously. The GPU is no longer a "graphics card" or an "AI accelerator" — it's a unified parallel processor that wields all three classes of compute. Which is exactly what the original CUDA pitch in 2006 promised, twenty years ahead of schedule.