How do we describe "batch size" in distributed training
Why sometimes reading the "batch size" in papers get confusing
When describing the amount of data seen in a single iteration of training, this is called the "batch size." For example, with a batch size of 8 every iteration through the data has 8 items from the dataset.
How does that change when we talk about distributed training?
Instead of just having "batch size", we have two more versions: the global batch size (GBS) and micro batch size (MBS).
Micro batch size is our original description of batch size (the size of the batch per accelerator).
Global batch size is the total amount of data seen across all GPUs during a single iteration, or: micro batch size times the number of accelerators used. (So for 8 GPUs, and a MBS of 8, our GBS is 64)
Typically if a paper talks about distributed training and it mentions batch size, it will be describing GBS not MBS.
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 25% off


