-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Add code for DDP tutorial series [PR 3 / 3] #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
@@ -0,0 +1,150 @@ | |||
# minGPT-DDP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a tutorial, and not quite an "example" as the others in this repo. I initially had these on a personal fork, but we don't want to use personal repos. If including this requires changing/diluting the policy, perhaps we can identify another repository to house these? cc @malfet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reusing torchrecipes repo for Paved Path recipes and plan to make it flat structure as pytorch/examples. The main difference between the 2 repos would be recipes in torchrecipes are more end-to-end and flexible on dependencies, while examples are basic and mainly depend on pytorch. How about moving these tutorials to torchrecipes eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment below, I think it makes sense to move this to torchrecipes eventually
@@ -0,0 +1,25 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the template, README and sbatch script need to be different for this PR and PR # 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sbatch script is only different in the path of the script to run.
- yaml.template and the cluster setup instructions are identical and independent of the script we're running
|
||
def _load_snapshot(self): | ||
try: | ||
snapshot = fsspec.open(self.config.snapshot_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's a lot of snapshot related code here, curious if you've tried https://github.com/pytorch/torchsnapshot and what you thought @yifuwang @ananthsub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't! I'd like to update the code with torchsnapshot once this gets merged in
|
||
return model, optimizer, train_set, test_set | ||
|
||
@hydra.main(version_base=None, config_path=".", config_name="gpt2_train_cfg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example of how this feels more like a recipe instead of an example @hudeven
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my understanding, how do we define recipes and examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain this at the beginning of the main README.md
but I can't say I'm super convinced that this delineation makes sense anymore. But I'd like @hudeven to make this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hesitant to call these "recipes" as per the definition there, since these examples don't make use of torchx or other tools we have for production-first users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suraj813 Here are my definitions.
pytorch/examples: basic examples showcasing how to use pytorch. mainly depends on pytorch
recipes: end-to-end examples showcasing pytorch ecosystem(torchx, torchdata, domains, cloud setup, etc).
Recipes are flexible on dependencies and topics, like hosting some hot applications. e.g. fine tune a stable diffusion model with checkpoint from HuggingFace, deploy a model with torchserve or pytorch mobile, distributed training in AWS with various clusters/schedulers.
For docs in pytorch/tutorials, their source code can be hosted in either repos. The basic ones go to the former and the complicated ones go to the later. What do you think?
cc: @shaojing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'd reckon this to be closer to a recipe because
- it contains files talking about distributed training on an AWS cluster
- there's potential to include pytorch ecosystem projects here too (eg: torchsnapshot as in one of the comments above, and perhaps more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to put it in examples for now. I plan to clean torchrecipes repo soon and we can move these tutorials there after that. Because torchrecipes repo is not ready to get more public visits at current state. @msaroufim what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there
- Please remove the data/input.txt file, it doesn't make sense to check a 40K line file, you can put it on your personal github and then wget from the raw link
- Please mention your series here https://github.com/pytorch/examples/blob/main/docs/source/index.rst so it renders on the website
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DDP side changes lgtm
* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml
Third (and final) PR for the DDP tutorial series. This code accompanies the tutorials staged at pytorch/tutorials#2049
This PR includes code for training gpt-like models with DDP. It adapts code from https://github.com/karpathy/mingpt