Pipeline parallelism splits the depth of the model (layer 0–15 on GPU-0, 16–31 on GPU-1, etc). Tensor parallelism, in contrast, slices within a single layer so the whole layer runs in parallel on every GPU at once.
Take a multi-head attention layer. In pipeline parallelism, the entire layer lives on one GPU; others sit idle until the activations are passed over. With tensor parallelism, each GPU holds only a slice of every weight matrix:
• GPU-0 owns heads 0–3, GPU-1 owns heads 4–7, etc.
• Every GPU performs the same matrix multiplications on its slice of the query, key, value weights.
• A fast all-reduce across GPUs stitches the partial results back into the full attention output before the next sub-layer starts.
The same trick is repeated for the feed-forward network: the first linear is column-sharded (each GPU holds a fraction of the output features), the second is row-sharded so the final all-reduce yields the correct activations.
Pipeline parallelism keeps each GPU’s memory footprint small by storing fewer layers, but introduces bubble time while earlier layers wait for later ones. Tensor parallelism keeps compute utilization high (every GPU is busy on every layer) but demands higher bandwidth for the intra-layer all-reduces. In practice you’ll often stack both: pipeline across nodes for memory, tensor within each node for speed.
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 35% off