Archive - The Mueller Minute

Distributed Operations: The Reduce Op

Getting data from all processes to one

Oct 17 •

torch.distributed.gather

Getting the data from other GPUs and centralizing them

Oct 12 •

The send/recv pattern

Our first introduction to a distributed operation

Oct 8 •

Batch Sampler Sharding

The third (and final) way to shard your data

Oct 7 •

DataLoader Dispatching

A memory efficient way to have your dataloaders, but at the cost of communication

Oct 6 •

Dataset Sharding

Transfer-efficient datasets during distributed training

Oct 5 •

August 2025

Synchronous Training

Full communication, 24/7

Aug 15 •

Smart Parameter Sharding

Making sure your communications are as efficient as possible

Aug 14 •

Sparsity Optimization

Also known as: Pruning

Aug 13 •

Overlapping computations and communications

A quick way to reduce the most expensive component of distributed training

Aug 12 •

Quantization Aware Training

Making your model work with low-precision inference easier

Aug 11 •

The Warm Start Problem

Resuming trainings the right way

Aug 10 •

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts