Add code for DDP tutorial series [PR 3 / 3] #1069

subramen · 2022-09-21T13:58:50Z

Third (and final) PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049

This PR includes code for training gpt-like models with DDP. It adapts code from https://github.com/karpathy/mingpt

netlify · 2022-09-21T13:58:54Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`4fe95ae`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-examples-preview/deploys/632dd252cf5a2600098545b0

distributed/minGPT-ddp/README.md

msaroufim · 2022-09-21T16:10:52Z

distributed/minGPT-ddp/README.md

@@ -0,0 +1,150 @@
+# minGPT-DDP


As a general note this whole PR doesn't quite match what we currently do in examples, as it feels more like a blog or tutorial. @hudeven @svekars what do you think should we change our policy?

It is a tutorial, and not quite an "example" as the others in this repo. I initially had these on a personal fork, but we don't want to use personal repos. If including this requires changing/diluting the policy, perhaps we can identify another repository to house these? cc @malfet

I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?

From the comment below, I think it makes sense to move this to torchrecipes eventually

distributed/minGPT-ddp/mingpt/model.py

msaroufim · 2022-09-21T16:15:42Z

distributed/minGPT-ddp/mingpt/slurm/sbatch_run.sh

@@ -0,0 +1,25 @@
+#!/bin/bash


Do the template, README and sbatch script need to be different for this PR and PR # 2?

sbatch script is only different in the path of the script to run.

yaml.template and the cluster setup instructions are identical and independent of the script we're running

msaroufim · 2022-09-21T16:18:09Z

distributed/minGPT-ddp/mingpt/trainer.py

+
+    def _load_snapshot(self):
+        try:
+            snapshot = fsspec.open(self.config.snapshot_path)


Since there's a lot of snapshot related code here, curious if you've tried https://github.com/pytorch/torchsnapshot and what you thought @yifuwang @ananthsub

I haven't! I'd like to update the code with torchsnapshot once this gets merged in

distributed/minGPT-ddp/mingpt/model.py

msaroufim · 2022-09-21T16:21:43Z

distributed/minGPT-ddp/mingpt/main.py

+
+    return model, optimizer, train_set, test_set
+
+@hydra.main(version_base=None, config_path=".", config_name="gpt2_train_cfg")


Another example of how this feels more like a recipe instead of an example @hudeven

for my understanding, how do we define recipes and examples?

I tried to explain this at the beginning of the main README.md but I can't say I'm super convinced that this delineation makes sense anymore. But I'd like @hudeven to make this call.

I am hesitant to call these "recipes" as per the definition there, since these examples don't make use of torchx or other tools we have for production-first users

@suraj813 Here are my definitions.
pytorch/examples: basic examples showcasing how to use pytorch. mainly depends on pytorch
recipes: end-to-end examples showcasing pytorch ecosystem(torchx, torchdata, domains, cloud setup, etc).

Recipes are flexible on dependencies and topics, like hosting some hot applications. e.g. fine tune a stable diffusion model with checkpoint from HuggingFace, deploy a model with torchserve or pytorch mobile, distributed training in AWS with various clusters/schedulers.

For docs in pytorch/tutorials, their source code can be hosted in either repos. The basic ones go to the former and the complicated ones go to the later. What do you think?

cc: @shaojing

In that case, I'd reckon this to be closer to a recipe because

it contains files talking about distributed training on an AWS cluster

there's potential to include pytorch ecosystem projects here too (eg: torchsnapshot as in one of the comments above, and perhaps more)

I think it's okay to put it in examples for now. I plan to clean torchrecipes repo soon and we can move these tutorials there after that. Because torchrecipes repo is not ready to get more public visits at current state. @msaroufim what do you think?

distributed/minGPT-ddp/README.md

hudeven

LGTM

msaroufim

Almost there

Please remove the data/input.txt file, it doesn't make sense to check a 40K line file, you can put it on your personal github and then wget from the raw link
Please mention your series here https://github.com/pytorch/examples/blob/main/docs/source/index.rst so it renders on the website

rohan-varma

DDP side changes lgtm

* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml

Adds files for minGPT training with DDP

c8dd8c0

facebook-github-bot added the cla signed label Sep 21, 2022

subramen requested review from msaroufim and rohan-varma September 21, 2022 13:59

msaroufim requested changes Sep 21, 2022

View reviewed changes

subramen added 4 commits September 22, 2022 12:13

filtered-clone, update script path, update readme

85a7899

add refs to karpathy's repo

3b8bb8e

add training data

9d54ed7

add AMP training

8351740

hudeven approved these changes Sep 23, 2022

View reviewed changes

msaroufim self-requested a review September 23, 2022 01:06

msaroufim requested changes Sep 23, 2022

View reviewed changes

subramen and others added 2 commits September 23, 2022 09:47

delete raw data file, update index.rst

5080154

Update gpt2_train_cfg.yaml

4fe95ae

rohan-varma approved these changes Sep 23, 2022

View reviewed changes

msaroufim merged commit d91085d into main Sep 26, 2022

msaroufim deleted the ddp-tutorial-code-3 branch September 26, 2022 03:01


		return model, optimizer, train_set, test_set

		@hydra.main(version_base=None, config_path=".", config_name="gpt2_train_cfg")

Add code for DDP tutorial series [PR 3 / 3] #1069

Add code for DDP tutorial series [PR 3 / 3] #1069

Uh oh!

Conversation

subramen commented Sep 21, 2022

Uh oh!

netlify bot commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudeven Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudeven left a comment

Choose a reason for hiding this comment

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

netlify bot commented Sep 21, 2022 •

edited

Loading

hudeven Sep 23, 2022 •

edited

Loading