CPU Offloading

When GPU memory isn't enough

Jul 25, 2025

CPU offloading is a memory optimization technique used in distributed training to manage GPU memory constraints by temporarily transferring (or "swapping") model states (such as parameters, gradients, or optimizer states) or activations from the GPU memory to the host CPU's RAM. This is particularly useful when the GPU memory is insufficient to hold all necessary data for training, even with other optimization techniques like sharding or activation checkpointing.

By offloading less frequently accessed data to the larger, albeit slower, CPU memory, the GPU can accommodate more critical data or larger batch sizes. The efficiency of CPU offloading heavily depends on overlapping the data transfers between GPU and CPU with computation on the GPU, so that the communication latency is hidden and does not become a significant bottleneck . DeepSpeed's ZeRO-Infinity, for example, extends ZeRO Stage 3 by enabling offloading of both model states and activations to CPU memory (and even NVMe storage), allowing for the training of extremely large models on limited GPU resources . This technique leverages the fact that CPU RAM is often much larger (hundreds of GBs to TBs) than GPU memory (tens of GBs).

Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 35% off

CPU Offloading

When GPU memory isn't enough

Discussion about this post