Why We Shard Models Across GPUs and Nodes
Why we can think of multiple GPUs as one big resource pool
Think of a GPU’s VRAM as a really fast, really tiny hard-drive. A 70 billion parameter model in half precision needs ~140 GB just to store the weights, plus another ~280 GB for optimizer states and gradients. That doesn’t fit on any single consumer card (or even most data-center cards). Sharding spreads those pieces across many GPUs and, when needed, across many nodes so the whole model can live in memory at once.
What Is ZeRO
ZeRO (Zero Redundancy Optimizer) is the trick DeepSpeed and FSDP uses to make that sharding automatic and cheap. Instead of every GPU holding a full copy of the model, optimizer states, and gradients, ZeRO slices each of those tensors and stores only the slice a given GPU actually needs during its part of the forward or backward pass.
The General Idea
ZeRO’s magic is “just-in-time assembly.” During training, every GPU keeps a tiny fraction of the total data. When a layer’s weights are required, ZeRO fetches only that shard from the other GPUs (or NVMe if you’re really pushed for space), runs the math, then ships the results back. Nothing is duplicated, so the effective model size scales with the number of devices instead of being capped by any single device.
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 35% off