Add code for DDP tutorial series [PR 1 / 3] #1067

subramen · 2022-09-21T13:30:35Z

First PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049

This PR includes code for single-gpu, multigpu and multinode training. Each training script builds on top of the previous one, allowing users to identify what changes when moving from one paradigm to another

netlify · 2022-09-21T13:30:42Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`730fdbb`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/632b723cdaf9400008fe8126

msaroufim

Looks good, some minor feedback

Use argparse when applicable to make it clear to users what the knobs they can turn are
Remove code duplication where possible

I really enjoyed reading the examples, especially with type annotations felt super easy

@malfet @rohan-varma would it be worth also adding distributed tests to examples? The older examples work on an older version of PyTorch and I'm worried the same will happen to this script.

msaroufim · 2022-09-21T15:53:47Z

distributed/ddp-tutorial-series/multigpu.py

+        rank: Unique identifier of each process
+        world_size: Total number of processes
+    """
+    os.environ["MASTER_ADDR"] = "localhost"


Should these be passed in with argparse?

Not for this tutorial. For a singlenode training, localhost is effectively the only value master_addr takes. I talk about this in the video too

msaroufim · 2022-09-21T15:55:24Z

distributed/ddp-tutorial-series/multigpu.py

+def main(rank: int, world_size: int, save_every: int, total_epochs: int):
+    ddp_setup(rank, world_size)
+    dataset, model, optimizer = load_train_objs()
+    train_data = prepare_dataloader(dataset, batch_size=32)


make batch size an argument to script

msaroufim · 2022-09-21T15:55:33Z

distributed/ddp-tutorial-series/multigpu.py

+
+if __name__ == "__main__":
+    import sys
+    total_epochs = int(sys.argv[1])


Replace with argparse

msaroufim · 2022-09-21T15:56:28Z

distributed/ddp-tutorial-series/multigpu.py

+        save_every: int,
+    ) -> None:
+        self.gpu_id = gpu_id
+        self.model = model.to(gpu_id)


TIL to(1) is the same as to(torch.device("cuda:1")

msaroufim · 2022-09-21T15:58:59Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

@@ -0,0 +1,101 @@
+import torch


There seems to be quite a bit of code duplication between this script and the ones before it. Not a dealbreaker for a tutorial per se but I think it would make things clearer for readers what changes from one script to another. So you'll have the base utils and then you build out each example on top of the last

This is by design. The tutorial is structured to show what the exact diff is when moving from one script to another (see https://github.com/pytorch/tutorials/blob/5fb19241ada89db8ace17faea2371447de28146b/beginner_source/ddp_multigpu.rst#diff-for-single_gpupy-vs-multigpupy for example). The pages on pytorch/tutorials + the videos that walkthrough these scripts explain the diff

msaroufim · 2022-09-21T15:59:33Z

distributed/ddp-tutorial-series/utils.py

@@ -0,0 +1,26 @@
+import torch


Add data to name of script

rohan-varma

LGTM overall, some minor questions.

Also curious if we want to add bash / SLURM scripts and clear instructions on how to multi-node launch on AWS clusters.

rohan-varma · 2022-09-21T19:16:41Z

distributed/ddp-tutorial-series/multigpu.py

+    def _run_batch(self, source, targets):
+        self.optimizer.zero_grad()
+        output = self.model(source)
+        loss = torch.nn.CrossEntropyLoss()(output, targets)


nit: use F.cross_entropy since this is being used in a functional style

rohan-varma · 2022-09-21T19:17:17Z

distributed/ddp-tutorial-series/multigpu.py

+
+    def _save_checkpoint(self, epoch):
+        ckp = self.model.module.state_dict()
+        torch.save(ckp, "checkpoint.pt")


PATH = "checkpoint.pt"
torch.save(ckp, PATH)
print(f"Epoch {epoch} | Training checkpoint saved at {PATH}")

rohan-varma · 2022-09-21T19:17:51Z

distributed/ddp-tutorial-series/multigpu.py

+    def train(self, max_epochs: int):
+        for epoch in range(max_epochs):
+            self._run_epoch(epoch)
+            if self.gpu_id == 0 and epoch % self.save_every == 0:


for multi-node on N nodes, this would result in N checkpoints being saved, is that the desired behavior?

IIUC after a restart, if global_rank: 0 happens to be on a different node, it would not be able to find the checkpoint. That's why I'm conditioning on local_rank==0 instead of global_rank==0

rohan-varma · 2022-09-21T19:19:02Z

distributed/ddp-tutorial-series/multigpu_torchrun.py

+        b_sz = len(next(iter(self.train_data))[0])
+        print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)}")
+        for source, targets in self.train_data:
+            source = source.to(self.gpu_id)


nit: this is not needed, DDP can move the inputs and the way we do it could potentially achieve some overlap and be more performant

actually, I guess that's needed for targets as targets not input into DDP model

DDP can move the inputs

can you share more? I can add this to the tutorial notes

rohan-varma · 2022-09-21T19:20:43Z

distributed/ddp-tutorial-series/multinode.py

+    def _run_batch(self, source, targets):
+        self.optimizer.zero_grad()
+        output = self.model(source)
+        loss = torch.nn.CrossEntropyLoss()(output, targets)


same, F.cross_entropy

rohan-varma · 2022-09-21T19:21:15Z

distributed/ddp-tutorial-series/multinode.py

+        snapshot = {}
+        snapshot["MODEL_STATE"] = self.model.module.state_dict()
+        snapshot["EPOCHS_RUN"] = epoch
+        torch.save(snapshot, "snapshot.pt")


snapshot_path?

Hah I had this instance var but forgot to use it here

rohan-varma · 2022-09-21T19:21:30Z

distributed/ddp-tutorial-series/requirements.txt

@@ -0,0 +1 @@
+torch>=1.11.0


since 1.12 is out, should we just use that?

I wouldn't want to limit the audience / force an upgrade just for the tutorial... that being said, i'm probably not using anything specific to 1.11 either

We can upgrade to the highest version that's better

rohan-varma · 2022-09-21T19:22:14Z

distributed/ddp-tutorial-series/multinode.py

+    import sys
+    total_epochs = int(sys.argv[1])
+    save_every = int(sys.argv[2])
+    main(save_every, total_epochs)


is there an accompanying launcher / SLURM script?

yep, in PR #1068

rohan-varma · 2022-09-21T19:22:58Z

distributed/ddp-tutorial-series/utils.py

+        return self.data[index]
+
+
+class MyRandomDataset(Dataset):


this class appears unused?

Ah yes, thanks for catching this

subramen added 2 commits September 21, 2022 09:25

starter code for ddp tutorials

e3139e3

add code for multigpu and multinode training

2ee9565

subramen requested a review from msaroufim September 21, 2022 13:30

facebook-github-bot added the cla signed label Sep 21, 2022

subramen changed the title ~~Add code for DDP tutorial series [PR 1/N]~~ Add code for DDP tutorial series [PR 1 / 3] Sep 21, 2022

subramen requested a review from rohan-varma September 21, 2022 13:59

msaroufim requested changes Sep 21, 2022

View reviewed changes

rohan-varma approved these changes Sep 21, 2022

View reviewed changes

subramen added 4 commits September 21, 2022 15:46

use argparse

d4ae934

rename utils -> datautils

a7e31a5

make batch_size an arg

22e5a03

remove unused classes, use vars instead of strings, functional loss

730fdbb

msaroufim approved these changes Sep 21, 2022

View reviewed changes

msaroufim merged commit f45e418 into main Sep 22, 2022

YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025

Add code for DDP tutorial series [PR 1 / 3] (pytorch#1067)

1297451

		@@ -0,0 +1 @@
		torch>=1.11.0

Add code for DDP tutorial series [PR 1 / 3] #1067

Add code for DDP tutorial series [PR 1 / 3] #1067

Uh oh!

Conversation

subramen commented Sep 21, 2022

Uh oh!

netlify bot commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

netlify bot commented Sep 21, 2022 •

edited

Loading