Gradient accumulation is a technique used in training deep learning models to effectively increase the batch size when the desired global batch size is too large to fit into the available GPU memory. Instead of updating the model parameters after processing every mini-batch, gradient accumulation involves performing multiple forward and backward passes (on smaller micro-batches) and accumulating the computed gradients (without updating the model).
The model parameters are then updated only after gradients from a specified number of micro-batches have been accumulated, effectively simulating a larger batch size. For example, if the desired global batch size is 1024 but only a batch size of 256 can fit in memory, one can process four microbatches of size 256, accumulate their gradients, and then perform a single optimizer step.
This technique is particularly useful in distributed training, especially when combined with pipeline parallelism, as it allows for the processing of smaller micro-batches to keep the pipeline stages busy while still achieving the statistical benefits of a larger effective batch size. It helps in reducing the memory pressure of each microbatch and facilitates more efficient pipeline parallelism.
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 35% off