Mixed Precision Training (FP16, BF16, and FP8)
Why training in lower precision matters
Mixed precision training is a computational technique used to accelerate deep learning model training and reduce memory footprint by utilizing lower-precision numerical formats for certain parts of the computation, typically for model parameters, activations, and gradients. The most common lower-precision formats used are 16-bit Brain Floating-Point (BF16) and lately 8-bit (FP8). BF16 has a dynamic range similar to FP32 but lower precision. The primary benefits of mixed precision training are reduced memory usage (as 16-bit values occupy half the memory of 32-bit values) and faster computation (as many modern GPUs have specialized hardware units that can perform arithmetic operations on 16-bit values much quicker than on 32-bit values). However, simply using lower precision can lead to numerical instability, such as underflow (small gradient values becoming zero) or overflow (large values exceeding the representable range), which can harm model convergence. Especially with FP8 kernels.
To mitigate these issues, mixed precision training typically involves techniques like loss scaling, where the loss value is multiplied by a factor before backpropagation, and the gradients are divided by the same factor before the optimizer step. This helps to keep gradient values within the representable range of FP16. Frameworks like PyTorch (via `torch.cuda.amp` for Automatic Mixed Precision) provide built-in support for mixed precision training, making it easier for users to apply these optimizations. It should also be noted that as FP8 training becomes more common and popular (given H100 hardware and above), unlike BF16 training when you will always see wins, FP8 training does not lower memory, it is strictly to increase FLOPS, and only matters when a certain threshold of tokens and model size is reached. (e.g.: processing around 2 million tokens / batch as a heuristic).
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 35% off


