-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Add code for DDP tutorial series [PR 2 / 3] #1068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
## 7. Download training code and install requirements | ||
``` | ||
cd /shared | ||
git clone https://github.com/suraj813/distributed-pytorch.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still the repo we want to link to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
|
||
## 3. Create a cluster config file | ||
``` | ||
pcluster configure --config config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?
@@ -0,0 +1,25 @@ | |||
#!/bin/bash | |||
|
|||
#SBATCH --job-name=multinode-example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turn these into command line arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a slurm power user but I've only seen these hardcoded in the scripts (https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/scheduler-examples/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.
``` | ||
pcluster configure --config config.yaml | ||
``` | ||
See config.yaml.template for an example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pcluster support command line arguments as well so we can parametrize the YAML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this example https://github.com/pytorch/data/tree/main/benchmarks/cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like only other arg it takes is region
. Ref: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.configure.html
``` | ||
sudo apt-get update | ||
sudo apt-get install -y python3-venv | ||
python3.7 -m venv /shared/venv/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is python 3.7 a hard dependency for this script to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll update it to python3. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, please address @msaroufim comments
@@ -0,0 +1,25 @@ | |||
#!/bin/bash | |||
|
|||
#SBATCH --job-name=multinode-example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.
srun torchrun \ | ||
--nnodes 4 \ | ||
--nproc_per_node 1 \ | ||
--rdzv_id $RANDOM \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we confirm this works by running the script? Not 100% sure if $RANDOM will be the same across nodes, but we'd need that to be the case right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it works, confirmed on video :D my understanding is srun
runs the same command on all nodes (and not computing $RANDOM at each node separately)
* Add code for multinode training on slurm * filtered-clone examples, update script path * python3.7 -> python3
Second PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049
This PR includes code for slurm cluster setup and multinode training on slurm.