The Mueller Minute

The Mueller Minute

Home
Notes
Archive
About
Distributed Operations: The Reduce Op
Getting data from all processes to one
Oct 17 • 
Zach Mueller
torch.distributed.gather
Getting the data from other GPUs and centralizing them
Oct 12 • 
Zach Mueller
The send/recv pattern
Our first introduction to a distributed operation
Oct 8 • 
Zach Mueller
Batch Sampler Sharding
The third (and final) way to shard your data
Oct 7 • 
Zach Mueller
DataLoader Dispatching
A memory efficient way to have your dataloaders, but at the cost of communication
Oct 6 • 
Zach Mueller
Dataset Sharding
Transfer-efficient datasets during distributed training
Oct 5 • 
Zach Mueller

August 2025

Synchronous Training
Full communication, 24/7
Aug 15 • 
Zach Mueller
Smart Parameter Sharding
Making sure your communications are as efficient as possible
Aug 14 • 
Zach Mueller
Sparsity Optimization
Also known as: Pruning
Aug 13 • 
Zach Mueller
Overlapping computations and communications
A quick way to reduce the most expensive component of distributed training
Aug 12 • 
Zach Mueller
Quantization Aware Training
Making your model work with low-precision inference easier
Aug 11 • 
Zach Mueller
The Warm Start Problem
Resuming trainings the right way
Aug 10 • 
Zach Mueller
© 2025 Zach Mueller
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture