Skip to content

Add code for DDP tutorial series [PR 3 / 3] #1069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 26, 2022
Merged

Conversation

subramen
Copy link
Contributor

Third (and final) PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049

This PR includes code for training gpt-like models with DDP. It adapts code from https://github.com/karpathy/mingpt

@netlify
Copy link

netlify bot commented Sep 21, 2022

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit 4fe95ae
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/632dd252cf5a2600098545b0

@@ -0,0 +1,150 @@
# minGPT-DDP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note this whole PR doesn't quite match what we currently do in examples, as it feels more like a blog or tutorial. @hudeven @svekars what do you think should we change our policy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a tutorial, and not quite an "example" as the others in this repo. I initially had these on a personal fork, but we don't want to use personal repos. If including this requires changing/diluting the policy, perhaps we can identify another repository to house these? cc @malfet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment below, I think it makes sense to move this to torchrecipes eventually

@@ -0,0 +1,25 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the template, README and sbatch script need to be different for this PR and PR # 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • sbatch script is only different in the path of the script to run.
  • yaml.template and the cluster setup instructions are identical and independent of the script we're running


def _load_snapshot(self):
try:
snapshot = fsspec.open(self.config.snapshot_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's a lot of snapshot related code here, curious if you've tried https://github.com/pytorch/torchsnapshot and what you thought @yifuwang @ananthsub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't! I'd like to update the code with torchsnapshot once this gets merged in


return model, optimizer, train_set, test_set

@hydra.main(version_base=None, config_path=".", config_name="gpt2_train_cfg")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of how this feels more like a recipe instead of an example @hudeven

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, how do we define recipes and examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain this at the beginning of the main README.md but I can't say I'm super convinced that this delineation makes sense anymore. But I'd like @hudeven to make this call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to call these "recipes" as per the definition there, since these examples don't make use of torchx or other tools we have for production-first users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suraj813 Here are my definitions.
pytorch/examples: basic examples showcasing how to use pytorch. mainly depends on pytorch
recipes: end-to-end examples showcasing pytorch ecosystem(torchx, torchdata, domains, cloud setup, etc).

Recipes are flexible on dependencies and topics, like hosting some hot applications. e.g. fine tune a stable diffusion model with checkpoint from HuggingFace, deploy a model with torchserve or pytorch mobile, distributed training in AWS with various clusters/schedulers.

For docs in pytorch/tutorials, their source code can be hosted in either repos. The basic ones go to the former and the complicated ones go to the later. What do you think?

cc: @shaojing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd reckon this to be closer to a recipe because

  • it contains files talking about distributed training on an AWS cluster
  • there's potential to include pytorch ecosystem projects here too (eg: torchsnapshot as in one of the comments above, and perhaps more)

Copy link
Contributor

@hudeven hudeven Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to put it in examples for now. I plan to clean torchrecipes repo soon and we can move these tutorials there after that. Because torchrecipes repo is not ready to get more public visits at current state. @msaroufim what do you think?

Copy link
Contributor

@hudeven hudeven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msaroufim msaroufim self-requested a review September 23, 2022 01:06
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there

  1. Please remove the data/input.txt file, it doesn't make sense to check a 40K line file, you can put it on your personal github and then wget from the raw link
  2. Please mention your series here https://github.com/pytorch/examples/blob/main/docs/source/index.rst so it renders on the website

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDP side changes lgtm

@msaroufim msaroufim merged commit d91085d into main Sep 26, 2022
@msaroufim msaroufim deleted the ddp-tutorial-code-3 branch September 26, 2022 03:01
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
* Adds files for minGPT training with DDP

* filtered-clone, update script path, update readme

* add refs to karpathy's repo

* add training data

* add AMP training

* delete raw data file, update index.rst

* Update gpt2_train_cfg.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants