01. CPU vs GPU — the two philosophies of compute
To understand a GPU, you first have to understand what it is not. A CPU and a GPU both process data, but they are built around opposite design philosophies. They optimize for completely different things, and that is the single most important fact in this entire guide.
Picture a CPU as a brilliant senior surgeon — it can perform any operation in the hospital, makes complex sequential decisions, but there's only a handful of them. Now picture a GPU as a stadium full of trained interns — each one can only do simple, identical steps, but ten thousand of them moving in unison can stitch up a city. The CPU optimizes per-thread latency; the GPU optimizes aggregate throughput.
Latency machine
4–128 powerful cores. Huge caches (often more than 50% of the die). Massive branch predictors and out-of-order execution. Goal: finish any single task as fast as possible, including unpredictable, branchy, sequential workloads.
Throughput machine
10,000–25,000 simple cores. Tiny caches per core. No branch prediction worth mentioning. Goal: finish millions of identical tasks per second by running them in massive parallel batches. Latency on any single thread is awful — aggregate throughput is staggering.
Why this matters for AI and graphics
Training a neural network, rendering a frame of a game, simulating fluid dynamics — these workloads all share the same DNA: do the same arithmetic on millions of data elements. Matrix multiplications, pixel shading, ray-triangle intersections — they decompose naturally into "perform operation X on element Y, where Y is independent of every other element". This is the only kind of work where you actually need ten thousand interns instead of one surgeon.
The CPU spends most of its transistor budget on control logic — figuring out what to do next. The GPU spends most of its transistor budget on arithmetic units — actually doing the work. On a modern H100, roughly 90% of the die area is compute, not control. That ratio is the GPU's superpower.
02. The physical die — GPU → GPC → TPC → SM
A modern NVIDIA GPU is not one monolithic chip. It is a hierarchical fractal of smaller and smaller compute units, each level grouping the level below. Once you internalize this hierarchy, every spec sheet suddenly makes sense.
The 5-level hierarchy
The whole chip. A single physical piece of silicon (or, on Blackwell, two dies stitched together with NV-HBI). Contains every GPC, all caches, memory controllers, NVLink/PCIe interfaces, and dedicated engines.
A self-contained "mini-GPU" with its own raster engine. The H100 die has 8 GPCs. Conceptually you can think of a GPC as an autonomous worker team that can handle a slice of any workload independently.
A pair of SMs that share a texture unit and some load/store hardware. The grouping comes from the GPU's graphics heritage — in compute workloads you mostly think one level above (GPC) and one level below (SM) and ignore the TPC.
The fundamental compute building block. This is where your CUDA threads actually run. An SM contains 128 CUDA cores, 4 Tensor Cores, a register file, L1/shared memory, warp schedulers, and (on RTX parts) an RT core. This is the brain of the GPU.
The smallest functional unit. Roughly equivalent to a single scalar arithmetic pipeline (FP32 ALU + INT32 ALU). Modern GPUs have between 14,000 and 25,000 of these. They are organized into groups of 32 that execute in lockstep — a warp.
Sitting alongside the SMs: Tensor Cores (matrix math), RT Cores (ray tracing), TMA / Tensor Memory Accelerator (Hopper+), Decompression Engine (Blackwell), NVENC/NVDEC (video), and copy engines for async data movement.
03. Inside a Streaming Multiprocessor — the compute engine
If you only memorize one diagram from this entire guide, make it the SM. Everything — every kernel, every CUDA call, every Tensor Core operation — ultimately runs inside this little block of silicon. An SM is itself divided into 4 processing partitions, each operating semi-independently. The structure has been remarkably stable since Volta (2017) and only gets refined each generation.
The four partitions, explained
Why four partitions? Because one Tensor Core operation produces enough work to keep 32 CUDA cores busy for many cycles, and you want four such streams running concurrently inside each SM to maximize utilization. Each partition is a self-contained pipeline:
- Warp Scheduler — the brain of the partition. Picks a ready warp (32 threads) every clock cycle and issues an instruction.
- Dispatch Unit — sends the chosen instruction to the appropriate execution units.
- Register File — 64 KB of registers (16,384 32-bit registers per partition, 65,536 per SM). Threads keep all their working variables here.
- 32 FP32 CUDA cores — the bread-and-butter math units.
- 16 FP64 cores + 32 INT32 cores — doubles and integers.
- 1 Tensor Core — the matrix-multiply monster (more on this later).
- LD/ST — load/store units for memory operations.
- SFU · Special Function Units — transcendentals like sin, cos, exp, sqrt, 1/x.
When you launch a CUDA kernel, the GPU's scheduler hands out thread blocks to SMs. Each SM can hold multiple blocks resident at once (up to 32 on Hopper, depending on resource usage). When one block stalls on memory, the SM instantly switches to another. This is the foundation of latency hiding.
04. The CUDA Core — what it actually is
The term "CUDA core" is one of the most misunderstood in all of GPU computing. NVIDIA uses it as a marketing-friendly proxy for "scalar arithmetic pipeline", but a CUDA core is not a complete processor in the way a CPU core is. It's much narrower.
A CUDA core is a single FP32 arithmetic pipeline — one lane in a 32-wide SIMT execution unit. It can execute one floating-point multiply-add per clock. It has no independent program counter (32 of them share one), no branch prediction, no instruction decoder. It exists to do math, fast, alongside 31 of its identical siblings.
What an FP32 CUDA core can do
- 1 FP32 FMA (fused multiply-add) per cycle — counts as 2 floating-point operations
- 1 FP32 addition or multiplication per cycle
- Integer ops on Volta+ (when paired with its INT32 sibling unit, can run integer math simultaneously with FP32)
- Logical/bitwise operations
What it cannot do alone: matrix multiplications (that's the Tensor Core), ray-triangle intersections (RT Core), reading textures (Texture Unit), evaluating sin/cos (SFU). When you write __expf(x) in your kernel, the work goes to the SFU, not the CUDA core.
Why we still count them
Despite the limitations, CUDA core count remains the single best first-order indicator of a GPU's general-purpose compute throughput. The math is simple:
05. Thread → Warp → Block → Grid — the CUDA execution model
Now we step from hardware into software. CUDA gives you a programming abstraction that maps cleanly onto the physical hierarchy you just saw. Master this 4-level model and you can write any GPU kernel.
The model in one paragraph
A CUDA kernel launch creates a grid of blocks. Each block holds up to 1,024 threads. The GPU's block scheduler distributes blocks across the SMs; once a block lands on an SM, it stays there until done. Inside the SM, the threads in each block are partitioned into warps of exactly 32 threads. The warp is the actual unit of hardware execution — all 32 threads in a warp execute the same instruction at the same time, just on different data.
Level by level
Thread
The smallest unit. Has its own program counter (effectively), its own slice of the register file, and its own slot in shared memory if you ask for one. Inside your kernel, a thread can ask "who am I?" via threadIdx.x, and that's how you assign each thread its slice of the data:
Warp
32 consecutive threads inside a block, glued together by the hardware. The warp size has been 32 since the original Tesla architecture in 2006 — do not design code around any other warp size. The first 32 threads in your block (threadIdx 0–31) form warp 0; the next 32 form warp 1; and so on. A warp is the unit of scheduling and execution — the GPU never schedules an individual thread; it schedules a warp.
Always choose blockDim as a multiple of 32. If you use 33 threads per block, the hardware still allocates a full second warp with 31 lanes masked off — you pay the cost of 64 threads to do the work of 33.
Block (a.k.a. Cooperative Thread Array, CTA)
A group of threads guaranteed to run on the same SM at the same time. This co-residency is what enables the two superpowers of a block:
- Shared memory — a tiny (~228 KB on Hopper, configurable), ultra-fast scratchpad that all threads in the block can read and write. Roughly 100× faster than HBM.
- Barrier sync —
__syncthreads()makes every thread wait at a line until all others arrive. Impossible to do across blocks; trivial inside one.
Max block size is 1,024 threads (= 32 warps). A typical good block size is 128, 256, or 512 threads.
Grid
The complete set of blocks launched by a single kernel call. Can be 1D, 2D, or 3D — just a convenience for thinking about the index space. Blocks in the same grid have no direct way to communicate or synchronize during the kernel (other than through global memory atomics or kernel completion). This independence is the source of the GPU's scalability: if your grid has 10,000 blocks and your GPU has 132 SMs, the scheduler just keeps feeding blocks to whichever SM is free.
Thread Block Cluster Hopper+
A new 5th level introduced in 2022 with the H100. A cluster is a group of up to 16 thread blocks that the hardware guarantees will be running concurrently on neighboring SMs, so they can directly read each other's shared memory (called distributed shared memory or DSMEM). This is a big deal for kernels that need cooperation across more threads than fit in a single block — like large GEMM tiles and attention kernels.
Think of the grid as a whole skyscraper full of project teams. Each block is one team in one room (one SM) — they can pass paper around their conference table (shared memory) and shout "wait!" to each other (barrier sync). Each warp is a 32-person sub-pod in the team marching in lockstep. The thread is one person at their desk. The cluster is a few neighboring teams that have a fast internal phone line (DSMEM) to coordinate without going to the lobby (HBM).
06. SIMT — how 32 threads share one instruction
NVIDIA's name for their execution model is SIMT — Single Instruction, Multiple Threads. It's a clever twist on the older SIMD (Single Instruction, Multiple Data) model used by CPU vector units.
One program, one vector register
You write code that explicitly says "operate on this 512-bit register containing 16 floats". Branches and data divergence are your problem — the compiler / programmer must arrange the data to fit the lane structure.
32 programs, hardware fuses them
You write code as if 32 independent threads each had their own logic. The hardware discovers at runtime that they share a PC and fuses their execution into one warp instruction. If they diverge (different branches), the hardware splits them and runs each path serially with the inactive lanes masked off.
Warp divergence
When threads in a warp take different paths through a branch, the warp diverges. The hardware handles this gracefully but at a real cost:
The hardware runs Path A first with 16 lanes active and 16 masked off; then runs Path B with the opposite mask. Total cost: 2× a normal warp, instead of the parallelism gain you'd expect. The lesson: design data layouts so consecutive threads do similar work. If lane 0's job is fundamentally different from lane 1's, you're fighting the architecture.
Pre-Volta hardware reconverged threads at the end of a divergent block automatically. Volta added per-thread program counters, allowing threads in a warp to run truly independently across long divergent code — at the cost of programmer responsibility for explicit reconvergence via __syncwarp(). This makes complex inter-thread coordination patterns possible (mutexes inside a warp, fine-grained producer-consumer), with the trade-off that you can now write deadlocks where Pascal would have just worked.
07. Warp scheduling & latency hiding — the GPU's secret weapon
Here is the most beautiful idea in GPU architecture: the GPU is happy to ignore memory latency because it always has other work to do. A CPU sees a cache miss and stalls. A GPU sees a cache miss and just runs a different warp.
-
An SM holds many resident warpsUp to 64 warps (= 2,048 threads) per SM on Hopper/Blackwell. Each scheduler tracks 16 of them. Crucially, this is far more threads than there are CUDA cores in the SM — that oversubscription is the whole point.
-
Every clock, the scheduler picks 1 "ready" warpA warp is ready when its next instruction has all its operands available (registers ready, no pending memory load, no barrier blocking it). The scheduler dispatches that warp's instruction to the appropriate execution unit.
-
Stalled warps cost nothingWhen a warp issues a load from HBM and has to wait ~400 ns for the data, the scheduler just picks a different ready warp. The stalled warp's registers still occupy space in the register file, but no execution units are wasted.
-
High occupancy = good latency hidingIf you have 32+ warps per scheduler, the probability that at least one is ready every cycle is ~100%, and the GPU runs at peak throughput. Low occupancy (few warps per SM) is one of the most common performance problems.
A CPU spends massive transistor budgets on cache hierarchies and out-of-order execution to avoid memory stalls. A GPU just runs other work during the stall. Both approaches hide latency — but the GPU's solution is far cheaper per transistor, which is why a GPU can pack so many more arithmetic units onto a die.
08. The memory hierarchy — from registers to HBM
Almost every GPU performance problem is, ultimately, a memory problem. The arithmetic units are insanely fast; the memory has to keep up. Understanding the hierarchy — six levels, each ~5–30× slower than the one above — is essential.
Level by level
Per-thread storage. ~256 KB total per SM (65,536 × 32-bit), partitioned across all resident threads. The fastest memory there is. The compiler decides what lives here.
Combined 228 KB scratchpad per SM. The split between "L1 cache" (hardware-managed) and "Shared memory" (programmer-managed) is configurable. Shared memory is the GPU's killer feature for cooperative algorithms.
256 KB of Tensor Memory per SM, brand new in Blackwell. A dedicated buffer the Tensor Cores read accumulator state from. Frees up shared memory for other uses during heavy matmul kernels.
Shared across the entire GPU. 50 MB on H100, 126 MB on B200, >100 MB on RTX PRO 6000. Massive last-line cache that helps amortize HBM bandwidth across SMs. Hopper added L2 residency hints.
The GPU's main RAM. High Bandwidth Memory: stacks of DRAM dies bonded directly to the GPU package via silicon interposer. 3.35 TB/s on H100, 4.8 TB/s on H200, 8 TB/s on B200. The bandwidth here determines how fast your kernels can really go.
Regular DDR5 attached to the CPU. Crossing PCIe Gen5 (~64 GB/s) or NVLink-C2C (Grace Hopper, ~450 GB/s) to get to it. Avoid touching during a kernel; transfer data upfront and keep it on the GPU.
Coalesced access — the #1 optimization
When a warp issues a global memory load, the hardware coalesces the 32 thread requests into the smallest set of 32-, 64-, or 128-byte transactions possible. The best case is when the 32 threads read 32 consecutive 4-byte elements: one 128-byte transaction serves the whole warp.
The bandwidth difference between these two patterns can easily be 30×. Most "my GPU kernel is slow" problems are non-coalesced loads in disguise.
09. Tensor Cores — the matrix-multiply specialists
Introduced with Volta in 2017, the Tensor Core is arguably the single most important hardware unit of the AI era. Where a CUDA core does one multiply-add per cycle, a Tensor Core does an entire small matrix multiply per cycle. The throughput-per-transistor difference is enormous, which is why AI on GPUs took off the moment Tensor Cores arrived.
Tensor Core generations
| Gen | First arch | Year | New formats | Notable feature |
|---|---|---|---|---|
| 1st | Volta (V100) | 2017 | FP16 · FP32 accum | The original; 4×4×4 MMA |
| 2nd | Turing (T4 / RTX 20) | 2018 | + INT8, INT4 | Consumer Tensor Cores |
| 3rd | Ampere (A100 / RTX 30) | 2020 | + TF32, BF16, sparsity | 2:4 structured sparsity 2× |
| 4th | Hopper (H100 / H200) | 2022 | + FP8 (E4M3, E5M2) | Transformer Engine |
| 5th | Blackwell (B200 / B300 / RTX PRO) | 2024–25 | + FP6, FP4 (NVFP4) | 2nd-gen Transformer Engine, TMEM |
The precision menu
8-bit exponent, 23-bit mantissa. The classic "float". Used as the high-precision accumulator for Tensor Core outputs.
8-bit exponent (FP32 range), 10-bit mantissa (FP16 precision). Drops 13 bits of precision for free 2× throughput. Default for training on A100+.
8-bit exponent, 7-bit mantissa. FP32-like range, ~3-digit precision. Almost universally the format used for training large LLMs since 2022.
5-bit exponent, 10-bit mantissa. Smaller range than BF16 — can overflow during training without care. Still widely used for inference.
Two variants: E4M3 (better precision, narrower range) and E5M2 (wider range, less precision). 2× throughput over FP16. Standard for state-of-the-art inference.
4 bits per number. Two-level scaling (per-block FP8 scale + per-tensor FP32 scale) preserves dynamic range. 2× throughput over FP8. Blackwell's headline inference format.
Throughput in numbers
| Format | H100 SXM5 (dense) | B200 (dense) | RTX PRO 6000 (dense) |
|---|---|---|---|
| FP32 (CUDA cores) | 67 TFLOPS | ~80 TFLOPS | ~125 TFLOPS |
| TF32 | 989 TFLOPS | 2.2 PFLOPS | ~500 TFLOPS |
| BF16 / FP16 | 1.98 PFLOPS | 4.5 PFLOPS | ~1 PFLOPS |
| FP8 | 3.96 PFLOPS | 9 PFLOPS | ~2 PFLOPS |
| FP4 / NVFP4 | — | 18 PFLOPS | ~4 PFLOPS |
Hopper introduced an NVIDIA software library + hardware support that dynamically chooses FP8 vs FP16 per layer during training. It tracks tensor statistics and decides which precision keeps quality without overflow. The 2nd-gen Transformer Engine on Blackwell extends this to FP4/NVFP4. This is one big reason FP8 training "just works" on Hopper while older hardware needed manual loss scaling.
10. RT Cores — ray tracing in silicon
Introduced with Turing in 2018, RT Cores are dedicated hardware for two operations: ray/box intersection (BVH traversal) and ray/triangle intersection. Without them, ray tracing requires hundreds of CUDA-core ops per ray; with them, it's a handful of cycles per intersection test.
Generations
- 1st gen (Turing, 2018) — the original. Real-time ray tracing in games becomes feasible.
- 2nd gen (Ampere, 2020) — 2× ray-triangle throughput, motion blur acceleration.
- 3rd gen (Ada, 2022) — another 2× ray-triangle; opacity micromaps; displaced micro-meshes.
- 4th gen (Blackwell, 2024) — 2× again on triangle intersect; RTX Mega Geometry; linear swept spheres for hair/foliage.
RT Cores are only present in RTX-branded parts (consumer RTX, Workstation RTX PRO). Data-center cards like A100, H100, B200 have no RT Cores — they're not designed for graphics.
11. Eighteen years of NVIDIA architectures
The story of NVIDIA's architectural evolution is the story of modern parallel computing. Each generation has typically delivered ~2× throughput per dollar, while adding one major new capability that reshaped what GPUs could do. Here's the full timeline.
| Arch | Year | Flagship | Process | What it introduced |
|---|---|---|---|---|
| Tesla | 2006 | G80 / 8800 GTX | 90nm | The original CUDA-capable GPU. Unified shaders. |
| Fermi | 2010 | GF100 / GTX 480 | 40nm | True caches (L1/L2), ECC, double-precision math, IEEE-754. |
| Kepler | 2012 | GK110 / Tesla K20 | 28nm | Dynamic parallelism, Hyper-Q, GPUDirect RDMA. |
| Maxwell | 2014 | GM200 / Titan X | 28nm | Massive efficiency gains. Unified L1/Shared. |
| Pascal | 2016 | GP100 / P100 | 16nm | HBM2, NVLink 1.0, FP16 throughput, unified memory. |
| Volta | 2017 | GV100 / V100 | 12nm | Tensor Cores (1st gen). Per-thread program counters. The AI era begins. |
| Turing | 2018 | TU102 / RTX 2080 Ti | 12nm | RT Cores, 2nd-gen Tensor Cores. DLSS 1.0. |
| Ampere | 2020 | GA100 / A100 | 7nm | 3rd-gen Tensor (TF32, BF16, 2:4 sparsity), MIG partitioning, 2nd-gen RT. |
| Ada Lovelace | 2022 | AD102 / RTX 4090 | 4N | 3rd-gen RT, DLSS 3 Frame Generation. Consumer-only. |
| Hopper | 2022 | GH100 / H100 | 4N | 4th-gen Tensor (FP8), Transformer Engine, TMA, DPX, Thread Block Clusters. |
| Hopper (refresh) | 2024 | GH200 / H200 | 4N | Same SM as H100; HBM3e: 141 GB · 4.8 TB/s. Inference-tuned. |
| Blackwell | 2024 | GB200 / B200 | 4NP | 5th-gen Tensor (FP4/FP6), TMEM, dual-die NV-HBI, NVLink 5.0, Decompression Engine. |
| Blackwell Ultra | 2025 | GB300 / B300 | 4NP | 288 GB HBM3e, 15 PFLOPS FP4 dense, 1100 W TDP. |
| Blackwell (consumer/pro) | 2025 | GB202 / RTX PRO 6000 | 4NP | 4th-gen RT, neural shaders, DLSS 4, 96 GB GDDR7. RTX 50-series silicon. |
12. H100 deep dive — the workhorse of the AI boom
Announced March 2022. From early 2023 through 2025 the H100 was, by revenue and shipments, the most successful AI accelerator in history. Every frontier lab trained their flagship models on tens of thousands of H100s. If you've used ChatGPT, Claude, or Gemini in the last two years, you've talked to an H100.
Architecture
The GH100 die has 144 SMs across 8 GPCs, fabricated on TSMC's 4N process — 80 billion transistors on a 814 mm² die. The shipping H100 SXM5 SKU enables 132 SMs for yield, giving 16,896 CUDA cores and 528 fourth-generation Tensor Cores. PCIe-form-factor H100 is more cut down at 114 SMs.
Each SM follows the standard four-partition layout we covered in Block 03, with 128 CUDA cores, 4 Tensor Cores, 228 KB of combined L1/Shared, and the new TMA (Tensor Memory Accelerator) for async global-to-shared transfers. The Tensor Cores are 4th gen, adding native FP8 (E4M3 and E5M2) which delivers 2× the throughput of FP16 with surprisingly graceful accuracy.
Memory & Interconnect
80 GB of HBM3 at 3.35 TB/s across five active stacks of a 5,120-bit memory bus. 50 MB of L2 cache. 18 NVLink 4.0 links delivering 900 GB/s of GPU-to-GPU bandwidth — the foundation of every 8-GPU DGX node and HGX server.
The features that matter
- FP8 Tensor Cores + Transformer Engine — doubled training throughput overnight.
- Thread Block Clusters & DSMEM — cross-SM shared memory for huge GEMM/attention tiles.
- TMA — async DMA between HBM and shared memory; frees CUDA cores from address arithmetic.
- DPX instructions — hardware acceleration for dynamic programming (genomics, route planning).
- Confidential Computing — trusted execution environment with attestation.
The H200 refresh (2024) keeps the exact same GH100 die but pairs it with 141 GB of HBM3e at 4.8 TB/s. Same compute; 1.4× the bandwidth and 1.76× the capacity. A targeted inference upgrade.
13. B200 deep dive — the dual-die jump
Announced March 2024, shipping in volume through 2025. Blackwell B200 is the most ambitious single-package GPU NVIDIA has ever built — two reticle-limit dies bonded into one logical GPU, with a new die-to-die fabric (NV-HBI) delivering 10 TB/s between them. From software's perspective it presents as one device.
Architecture
Each die is fabricated on TSMC 4NP and holds 104 B transistors. The complete B200 package is 208 B transistors — 2.6× more than H100. The shipping configuration enables 148 SMs across 8 GPCs, holding ~18,944 CUDA cores and 592 fifth-generation Tensor Cores.
The SM has been heavily reworked: the 5th-gen Tensor Core adds FP4 and FP6 formats, including NVFP4 with its two-level micro-block scaling. A new Tensor Memory (TMEM) — 256 KB per SM — serves as the dedicated accumulator buffer for Tensor Core output, freeing up shared memory. Async copy paths through the Tensor Memory Accelerator are also faster.
Memory & Interconnect
192 GB of HBM3e across 8 stacks delivering a staggering 8 TB/s. The L2 cache grows to 126 MB. NVLink 5.0 doubles bandwidth to 1.8 TB/s per GPU, and the new NVLink Switch System lets racks scale to 72 fully-connected GPUs (GB200 NVL72) where the entire 72-GPU domain looks like one giant accelerator.
New engines
- 2nd-gen Transformer Engine — native FP4 / NVFP4 path for both training and inference.
- Decompression Engine — hardware decompression (LZ4, Snappy, Deflate) for ETL/analytics workloads.
- RAS Engine — predictive reliability monitoring across the 208 B transistors.
- Confidential Computing v2 — extended trust model with TEE-encrypted NVLink.
The Blackwell Ultra refresh (B300, 2025) pushes further: 288 GB HBM3e, 15 PFLOPS dense FP4, 1100 W TDP. Same architecture, more of everything.
14. RTX PRO 6000 Blackwell — the workstation flagship
The RTX PRO 6000 Blackwell Workstation Edition (2025) is the highest-end professional graphics card NVIDIA ships — the spiritual successor to the Quadro RTX line. It's built on the same GB202 silicon as the consumer RTX 5090, but with full chip enablement, 4× the VRAM, ECC memory, and certified workstation drivers.
Architecture
The GB202 die is single-die (unlike the dual-die B200) and shares the SM architecture: 5th-gen Tensor Cores with full FP4 support, but unlike B200 it also includes 4th-gen RT Cores — data-center Blackwell (B200) has none. The RTX PRO 6000 enables the full die: 188 SMs · 24,064 CUDA cores · 752 Tensor Cores · 188 RT Cores.
Memory & Interconnect
96 GB of GDDR7 with ECC on a 512-bit bus delivering 1.8 TB/s. No HBM — GDDR7 is cheaper and still very fast. No NVLink either — this is a single-GPU workstation card. PCIe Gen5 x16 (~64 GB/s) to the host.
Who it's for
The 96 GB capacity is the killer feature: you can fit Llama 3 70B at FP4, or fine-tune mid-sized models, or run heavy 3D / simulation / video production tools, on a single workstation. List price around USD 8,500 — one-tenth of an H100, with comparable inference performance for smaller models. For solo AI developers and creative pros, it's currently the most capable single workstation GPU on Earth.
15. Side-by-side — the five GPUs in one table
Pin this table. Most of the time, choosing the right NVIDIA GPU for a given workload is just a matter of reading down one column.
| Spec | H100 SXM5 | H200 SXM | B200 | B300 (Ultra) | RTX PRO 6000 BW |
|---|---|---|---|---|---|
| Architecture | Hopper | Hopper | Blackwell | Blackwell Ultra | Blackwell (cons) |
| Release | 2022 | 2024 | 2024 | 2025 | 2025 |
| Process | TSMC 4N | TSMC 4N | TSMC 4NP | TSMC 4NP | TSMC 4NP |
| Transistors | 80 B | 80 B | 208 B (2-die) | 208 B (2-die) | ~92 B |
| SMs | 132 | 132 | 148 | 148 | 188 |
| CUDA cores | 16,896 | 16,896 | ~18,944 | ~18,944 | 24,064 |
| Tensor Cores | 528 · 4th | 528 · 4th | 592 · 5th | 592 · 5th | 752 · 5th |
| RT Cores | — | — | — | — | 188 · 4th |
| FP32 peak | 67 TFLOPS | 67 TFLOPS | ~80 TFLOPS | ~85 TFLOPS | ~125 TFLOPS |
| BF16/FP16 dense | 1.98 PF | 1.98 PF | 4.5 PF | 5 PF | ~1 PF |
| FP8 dense | 3.96 PF | 3.96 PF | 9 PF | 10 PF | ~2 PF |
| FP4 / NVFP4 dense | — | — | 18 PF | 15 PF* | ~4 PF |
| Memory capacity | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | 288 GB HBM3e | 96 GB GDDR7 ECC |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | 8 TB/s | 8 TB/s | 1.8 TB/s |
| L2 cache | 50 MB | 50 MB | 126 MB | 126 MB | ~96 MB |
| NVLink | 900 GB/s (4.0) | 900 GB/s (4.0) | 1.8 TB/s (5.0) | 1.8 TB/s (5.0) | None |
| Host interconnect | PCIe Gen5 x16 | PCIe Gen5 x16 | PCIe Gen5 x16 | PCIe Gen5 x16 | PCIe Gen5 x16 |
| TDP | 700 W | 700 W | 1000 W | 1100 W | 600 W |
| Form factor | SXM5 | SXM | SXM (HGX) | SXM (HGX) | Dual-slot PCIe |
| Cooling | Liquid (DGX) | Liquid | Liquid | Liquid | Air or liquid |
16. NVLink & interconnect — scaling beyond one GPU
For models that exceed one GPU's memory, or training runs that need thousands of GPUs cooperating tightly, the interconnect matters as much as the GPU itself. NVIDIA's bet has been to treat the interconnect as part of the architecture, not an afterthought. NVLink, NVSwitch, and the NVLink Switch System are the result.
~64 GB/s. The standard way the GPU talks to the host CPU. Adequate for batched data movement and storage I/O; far too slow for fine-grained collective ops across GPUs.
900 GB/s per GPU, ~18 links of 50 GB/s each. Connects the 8 GPUs in an HGX H100 baseboard through NVSwitches into an all-to-all topology.
1.8 TB/s per GPU — exactly 2× NVLink 4.0. 18 links × 100 GB/s. Foundation of the NVL72 rack-scale system.
A dedicated networking ASIC that creates a non-blocking fabric between 8 (Hopper) or up to 72 (Blackwell) GPUs. Every GPU sees every other GPU at full NVLink bandwidth.
Blackwell-era rack architecture. 72 B200 GPUs in one liquid-cooled rack, fully NVLink-connected at 1.8 TB/s each. Looks like a single 13.8 TB GPU. Designed for trillion-parameter training and inference.
~400–800 Gb/s per port. Connects racks together into a SuperPOD — 10,000+ GPUs cooperating on a single training job. RDMA enables direct GPU-to-GPU writes across nodes.
Modern training relies on heavy collective communication (all-reduce, all-gather) every step. At trillion-parameter scale, every gradient update is gigabytes of cross-GPU traffic. AMD and Intel have competitive raw FLOPS, but no one ships an interconnect like NVLink Switch + NVSwitch at NVIDIA's scale. This is why NVIDIA still owns the data center: it's not just the GPU, it's the whole rack.
17. CUDA programming — tying it all together
We've covered the hardware, the execution model, the memory hierarchy, and the specialty units. Let's end with a concrete look at the software side: what a CUDA kernel actually looks like, and how the layered software stack maps your Python down to the SASS instructions an SM executes.
A real kernel: SAXPY
SAXPY (Single-precision A·X Plus Y) is the "hello, world" of CUDA: Y = a·X + Y where X and Y are large arrays and a is a scalar. It's the simplest non-trivial kernel that exercises the full system.
What the GPU does with this
-
Host submits a launchThe
<<<blocks, threads>>>syntax compiles to a runtime call that queues a kernel launch on the GPU's command stream. The CPU returns immediately — the GPU runs asynchronously. -
Block scheduler distributes workThe 4,096 blocks are dispatched to the 132 SMs of an H100. Each SM holds multiple blocks resident at once based on resource usage; the rest wait in queue and stream in as earlier blocks finish.
-
Warps form and executeWithin each block, threads 0–31 form warp 0, 32–63 form warp 1, and so on. The hardware schedules and executes warps, not individual threads.
-
Blocks land on SMs · warps run SIMTThe grid scheduler distributes blocks across the SMs. Inside each SM, a warp scheduler picks ready warps each cycle and issues them to the CUDA cores. With ~1 million threads and 132 SMs, every SM gets thousands of threads — plenty for latency hiding.
-
Memory access is coalescedBecause consecutive threads read consecutive array indices, each warp's loads collapse into a single 128-byte HBM transaction. This kernel runs near peak memory bandwidth.
CUDA software layers
The high-level C/C++ API (cudaMalloc, cudaMemcpy, <<<...>>>). What 99% of users actually write. Comes with a hot-loaded driver underneath.
Lower-level. Lets you load PTX/cubin modules dynamically, manage contexts, do things runtime can't. Used by frameworks like PyTorch internally.
"Parallel Thread Execution" — NVIDIA's virtual ISA. Like LLVM IR for GPUs. NVCC emits PTX; the driver JITs it down to actual SASS instructions for the specific architecture.
The actual binary instructions the GPU executes. Architecture-specific (different for Hopper vs Blackwell). What you see in cuobjdump --dump-sass.
NVIDIA's hand-tuned libraries. cuBLAS = linear algebra; cuDNN = deep-learning primitives (convolutions, attention); CUTLASS = templatized GEMM kernels you can compose. PyTorch, JAX, and TensorRT call these internally.
Modern compiler-driven layers that let you write GPU kernels in Python-ish DSLs and get near-cuBLAS performance. Triton was the breakthrough — FlashAttention was first written in it.
CUDA is not just an API — it's a complete ecosystem: hardware (SMs, Tensor Cores) + compiler (NVCC, PTX) + libraries (cuDNN, cuBLAS, NCCL) + frameworks (PyTorch, JAX, TensorRT) + cluster software (Magnum IO, NCCL). The hardware is impressive, but the moat NVIDIA has built around the hardware is even bigger.