Distributed Training

What do we mean by "Distributed Training"

Jul 06, 2025

Welcome to the inaugural issue of the Mueller Minute! In each post I will be discussing a concept about distributed training, common bugs I've seen, or just interesting papers that I've been reading every day and distill it into a one minute read. As an introduction, today's post will be on Distributed Training.

Just what is "distributed training?"

Distributed training is when we try to scale machine learning scripts to run from one accelerator (a thing which runs matrix multiplications fast, like CUDA or NPU) to many.

This could be in the form of multi-GPU (where there are multiple GPUs, but on a single computer) or multi-node (where we have multiple computers each with multiple GPUs).

The general reasons for scaling up model training are:

1. Uses too much VRAM (need more compute)

2. Takes too long to train (we need to iterate batches faster)

Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here

Distributed Training

What do we mean by "Distributed Training"

Discussion about this post