Skip to content

Add code for DDP tutorial series [PR 2 / 3] #1068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 22, 2022
Merged

Conversation

subramen
Copy link
Contributor

Second PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049

This PR includes code for slurm cluster setup and multinode training on slurm.

@netlify
Copy link

netlify bot commented Sep 21, 2022

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit a83b97d
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/632c8247bf186c00092da4a3

## 7. Download training code and install requirements
```
cd /shared
git clone https://github.com/suraj813/distributed-pytorch.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the repo we want to link to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops


## 3. Create a cluster config file
```
pcluster configure --config config.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudeven @shaojing @d4l3k when this gets merged it'll be the first recipe we have in examples. Pointing it out because it'd be interesting to do more of this with some longer term goals in mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?

@@ -0,0 +1,25 @@
#!/bin/bash

#SBATCH --job-name=multinode-example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn these into command line arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.

```
pcluster configure --config config.yaml
```
See config.yaml.template for an example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pcluster support command line arguments as well so we can parametrize the YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like only other arg it takes is region. Ref: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.configure.html

```
sudo apt-get update
sudo apt-get install -y python3-venv
python3.7 -m venv /shared/venv/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is python 3.7 a hard dependency for this script to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll update it to python3. Thanks!

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please address @msaroufim comments

@@ -0,0 +1,25 @@
#!/bin/bash

#SBATCH --job-name=multinode-example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.

srun torchrun \
--nnodes 4 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we confirm this works by running the script? Not 100% sure if $RANDOM will be the same across nodes, but we'd need that to be the case right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it works, confirmed on video :D my understanding is srun runs the same command on all nodes (and not computing $RANDOM at each node separately)

@msaroufim msaroufim self-requested a review September 21, 2022 23:52
@msaroufim msaroufim merged commit 84b7588 into main Sep 22, 2022
@msaroufim msaroufim deleted the ddp-tutorial-code-2 branch September 22, 2022 15:48
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
* Add code for multinode training on slurm

* filtered-clone examples, update script path

* python3.7 -> python3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants