-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Add code for DDP tutorial series [PR 2 / 3] #1068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
Region: us-east-1 | ||
|
||
Image: | ||
Os: ubuntu1804 | ||
|
||
SharedStorage: | ||
- MountDir: /shared | ||
Name: shared-fs | ||
StorageType: FsxLustre | ||
FsxLustreSettings: | ||
StorageCapacity: 1200 | ||
DeploymentType: SCRATCH_1 | ||
StorageType: SSD | ||
|
||
HeadNode: | ||
InstanceType: c5.xlarge | ||
Networking: | ||
SubnetId: subnet-xxxxxxx | ||
Ssh: | ||
KeyName: your-keyname-file | ||
|
||
Scheduling: | ||
Scheduler: slurm | ||
SlurmQueues: | ||
- Name: train | ||
ComputeResources: | ||
- Name: p32xlarge | ||
InstanceType: p3.2xlarge | ||
MinCount: 0 | ||
MaxCount: 5 | ||
Networking: | ||
SubnetIds: | ||
- subnet-xxxxxxx |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#!/bin/bash | ||
|
||
#SBATCH --job-name=multinode-example | ||
#SBATCH --nodes=4 | ||
#SBATCH --ntasks=4 | ||
#SBATCH --gpus-per-task=1 | ||
#SBATCH --cpus-per-task=4 | ||
|
||
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) | ||
nodes_array=($nodes) | ||
head_node=${nodes_array[0]} | ||
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | ||
|
||
echo Node IP: $head_node_ip | ||
export LOGLEVEL=INFO | ||
|
||
srun torchrun \ | ||
--nnodes 4 \ | ||
--nproc_per_node 1 \ | ||
--rdzv_id $RANDOM \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did we confirm this works by running the script? Not 100% sure if $RANDOM will be the same across nodes, but we'd need that to be the case right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes it works, confirmed on video :D my understanding is |
||
--rdzv_backend c10d \ | ||
--rdzv_endpoint $head_node_ip:29500 \ | ||
/shared/examples/multinode_torchrun.py 50 10 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Setup AWS cluster with pcluster | ||
|
||
## 1. Sign in to an AWS instance | ||
|
||
## 2. Install pcluster | ||
``` | ||
pip3 install awscli -U --user | ||
pip3 install "aws-parallelcluster" --upgrade --user | ||
``` | ||
|
||
## 3. Create a cluster config file | ||
``` | ||
pcluster configure --config config.yaml | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually? |
||
``` | ||
See config.yaml.template for an example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pcluster support command line arguments as well so we can parametrize the YAML? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what you mean There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See this example https://github.com/pytorch/data/tree/main/benchmarks/cloud There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, looks like only other arg it takes is |
||
|
||
|
||
## 4. Create the cluster | ||
``` | ||
pcluster create-cluster --cluster-name dist-ml --cluster-configuration config.yaml | ||
``` | ||
|
||
### 4a. Track progress | ||
``` | ||
pcluster list-clusters | ||
``` | ||
|
||
## 5. Login to cluster headnode | ||
``` | ||
pcluster ssh --cluster-name dist-ml -i your-keyname-file | ||
``` | ||
|
||
## 6. Install dependencies | ||
``` | ||
sudo apt-get update | ||
sudo apt-get install -y python3-venv | ||
python3 -m venv /shared/venv/ | ||
source /shared/venv/bin/activate | ||
pip install wheel | ||
echo 'source /shared/venv/bin/activate' >> ~/.bashrc | ||
``` | ||
|
||
## 7. Download training code and install requirements | ||
``` | ||
cd /shared | ||
git clone --depth 1 https://github.com/pytorch/examples; | ||
cd /shared/examples | ||
git filter-branch --prune-empty --subdirectory-filter distributed/ddp-tutorial-series | ||
python3 -m pip install setuptools==59.5.0 | ||
pip install -r requirements.txt | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turn these into command line arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a slurm power user but I've only seen these hardcoded in the scripts (https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/scheduler-examples/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my understanding with sbatch as well. Not sure if we can make them cmdline args.