Skip to content

Add code for DDP tutorial series [PR 2 / 3] #1068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions distributed/ddp-tutorial-series/slurm/config.yaml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Region: us-east-1

Image:
Os: ubuntu1804

SharedStorage:
- MountDir: /shared
Name: shared-fs
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
DeploymentType: SCRATCH_1
StorageType: SSD

HeadNode:
InstanceType: c5.xlarge
Networking:
SubnetId: subnet-xxxxxxx
Ssh:
KeyName: your-keyname-file

Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: train
ComputeResources:
- Name: p32xlarge
InstanceType: p3.2xlarge
MinCount: 0
MaxCount: 5
Networking:
SubnetIds:
- subnet-xxxxxxx
23 changes: 23 additions & 0 deletions distributed/ddp-tutorial-series/slurm/sbatch_run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

#SBATCH --job-name=multinode-example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn these into command line arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.

#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO

srun torchrun \
--nnodes 4 \
--nproc_per_node 1 \
--rdzv_id $RANDOM \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we confirm this works by running the script? Not 100% sure if $RANDOM will be the same across nodes, but we'd need that to be the case right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it works, confirmed on video :D my understanding is srun runs the same command on all nodes (and not computing $RANDOM at each node separately)

--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
/shared/examples/multinode_torchrun.py 50 10
51 changes: 51 additions & 0 deletions distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Setup AWS cluster with pcluster

## 1. Sign in to an AWS instance

## 2. Install pcluster
```
pip3 install awscli -U --user
pip3 install "aws-parallelcluster" --upgrade --user
```

## 3. Create a cluster config file
```
pcluster configure --config config.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudeven @shaojing @d4l3k when this gets merged it'll be the first recipe we have in examples. Pointing it out because it'd be interesting to do more of this with some longer term goals in mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?

```
See config.yaml.template for an example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pcluster support command line arguments as well so we can parametrize the YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like only other arg it takes is region. Ref: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.configure.html



## 4. Create the cluster
```
pcluster create-cluster --cluster-name dist-ml --cluster-configuration config.yaml
```

### 4a. Track progress
```
pcluster list-clusters
```

## 5. Login to cluster headnode
```
pcluster ssh --cluster-name dist-ml -i your-keyname-file
```

## 6. Install dependencies
```
sudo apt-get update
sudo apt-get install -y python3-venv
python3 -m venv /shared/venv/
source /shared/venv/bin/activate
pip install wheel
echo 'source /shared/venv/bin/activate' >> ~/.bashrc
```

## 7. Download training code and install requirements
```
cd /shared
git clone --depth 1 https://github.com/pytorch/examples;
cd /shared/examples
git filter-branch --prune-empty --subdirectory-filter distributed/ddp-tutorial-series
python3 -m pip install setuptools==59.5.0
pip install -r requirements.txt
```