Synchronous Training

Full communication, 24/7

Aug 15, 2025

Synchronous training is a distributed training paradigm where all worker processes (e.g., GPUs) perform forward and backward passes on their respective data shards and then synchronize their results, typically gradients, before updating the model parameters.

In this approach, all workers move in lockstep; each training iteration is completed only when all workers have finished their computations and the gradients have been aggregated (e.g., averaged via an all-reduce operation).

The key characteristic is that all replicas of the model are updated with the same aggregated gradients, ensuring that all workers have a consistent view of the model parameters at the beginning of each new iteration. This consistency generally leads to more stable convergence and predictable training behavior compared to asynchronous methods.

Most common implementations of data parallelism, such as PyTorch's `DistributedDataParallel` (DDP), use synchronous training. While synchronous training is robust, it can be susceptible to stragglers (slower workers) because the overall iteration time is determined by the slowest worker in the group.

We're officially 2 weeks out. In one week the preorder discount will vanish forever, and these 35% off coupons will truly be 35% off. Come join the best distributed course out there right now, get free compute, and join a group of life-long learners today

Synchronous Training

Full communication, 24/7

Discussion about this post