When dealing with running python scripts during distributed training, doing python myscript.py
won't cut it. This is because (especially with multi-node) the same script needs to:
1. Setup PyTorch in such a way that distributed training works (torch.distributed
has the information it needs like the number of GPUs and nodes)
2. Be ran across all computers at once (all computers need the same script at the same time)
Typically, this is accomplished through:
1. Setting up a network file storage (basically all computers can mount a drive which is accessible over the internet, and can read from a single source of truth)
2. Configuring passwordless ssh between each node and calling deepspeed myscript.py
on a single node to facilitate the training run
Thanks for reading the Mueller Minute. If you have further questions on any of the subjects written here, feel free to reach out. I'm also building a course around this subject, the first cohort happening September 1st. Sign up here for 25% off