Skip to content

[pull] main from pytorch:main #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 128 commits into
base: main
Choose a base branch
from
Open

[pull] main from pytorch:main #3

wants to merge 128 commits into from

Conversation

pull[bot]
Copy link

@pull pull bot commented Sep 13, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Sep 13, 2022
vfdev-5 and others added 28 commits September 14, 2022 10:19
Revert "Add mps device (#1018)"

This reverts commit 616caed.
* Add mps device

* Add --mps to run_python_examples.sh

* Update imagenet with mps device

* Use curl in run_python_examples.sh to accommodate macOS devices

* Fix for https://github.com/pytorchq/examples/issues/1060
* Add code for multinode training on slurm

* filtered-clone examples, update script path

* python3.7 -> python3
* Adds files for minGPT training with DDP

* filtered-clone, update script path, update readme

* add refs to karpathy's repo

* add training data

* add AMP training

* delete raw data file, update index.rst

* Update gpt2_train_cfg.yaml
After training a model in Mac with mps option, when trying to run the generate script it is giving a Runtime error "Placeholder storage has not been allocated on MPS device". To avoid this issue, this change is made
* fix device mismatch issue #1071
Add set_epoch for shuffling inputs, fix arg order
…ORCE (#1083)

Replace list with deque to obtain O(1) time complexity of insertion at the beginning of the list of returns
* Example of MNIST using RNN

* Example of MNIST using RNN: Changed RNN type to LSTM and changed variable names

* Example of MNIST using RNN: Resolving review comments

* Example of MNIST using RNN: Removing unintentional new line
* fix device mismatch issue #1071

* fix device mismatch issue #1071

* add mnist_rnn to test script for CI

* support dry_run in test()
val data should not shuffle
…PI and deprecate the old one (#1099)

* [PT-D][Tensor Parallel] Update the example for TP to use DTensor and new TP API
* word language model on Jetson NX

When running the word language model on Jetson NX, the original main.py fails caused by that torch (NVIDIA offical pytorch docker image: `l4t-pytorch:r35.1.0-pth1.11-py3`) do not have the `mps` backend. This modification has fixed the problem.

* Update word_language_model/main.py

That's better!

Co-authored-by: Steven Liu <[email protected]>

Co-authored-by: Steven Liu <[email protected]>
c-p-i-o and others added 30 commits November 5, 2024 13:23
Summary:
1. Pick specific version of torchvision to fix dependency errors
2. Pin numpy to be below version 2.
3. Update Python version in python tests.

Test Plan:
Tested locally.
Summary:
Fix up the FSDP tutorial to get it functional again.
1. Add missing import for load_dataset.
2. Use `checkpoint` instead of `_shard.checkpoint` to get rid of a
   warning.
3. Add nlp to requirements.txt
4. Get rid of `load_metric` as this function does not exist in new
   `datasets` module.
5. Add `legacy=False` to get rid of tokenizer warnings.

Test Plan:
Ran the tutorial as follows and ensured that it ran successfully:
```
torchrun --nnodes=1 --nproc_per_node=2 T5_training.py
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in
default, to avoid your system being overloaded, please further tune the
variable for optimal performance in your application as needed.
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
bFloat16 enabled for mixed precision - using bfSixteen policy
```
correct `model.train` description to be 'Put model into training mode' as opposed to 'Put model into inference mode'
* Add requirements.txt to examples which miss them

Signed-off-by: Dmitry Rogozhkin <[email protected]>

* Update numpy requirement for reinforcement_learning to be <2

Current version of the example requires `numpy<2` otherwise the following
error can be seen:
```
AttributeError: module 'numpy' has no attribute 'bool8'. Did you mean: 'bool'?
```

Signed-off-by: Dmitry Rogozhkin <[email protected]>

* Update torch requirement for time and word examples to be <2.6

Current version of examples require `torch<2.6` otherwise the following
error can be seen:
```
  File "/pytorch/examples/time_sequence_prediction/train.py", line 47, in <module>
    data = torch.load('traindata.pt')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pytorch/examples/time_sequence_prediction/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1524, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
```

Signed-off-by: Dmitry Rogozhkin <[email protected]>

* Respect each example requirements and use uv

This commit introduces few changes to CI by modifying `run_*_examples.sh`
and respective github workflows:

* Switched to uv
* Added tearup and teardown stages for tests (`start()` and `stop()` methods
  wrapping up test bodies - these are called automatically)
* Tearup (`start()`) installs example dependencies and, optionally (if `VIRTUAL_ENV=.venv`
  is passed), creates uv virtual environment
* Teardown (`stop()`) removes uv virtual environment if it was created (to
  save space)
* If no `VIRTUAL_ENV` set, then scripts expect to be executed in the existing
  virtual environment. These can be `python -m venv`, `uv env` or `conda env`.
  In this case example dependencies will be installed in this environment
  potentially reinstalling existing packages (including `torch`!).
* Dropped automated detection of CUDA platform. Now scripts require `USE_CUDA=True`
  to be passed explicitly
* Added `PIP_INSTALL_ARGS` environment variable to be passed to `uv pip install` calls
  for each example dependencies. This allows to adjust torch indices and other options.

Execute all tests in current virtual environment (might rewrite packages):
```
./run_distributed_examples.sh
```

Execute all tests creating separate environment for each example:
```
VIRTUAL_ENV=.venv ./run_distributed_examples.sh
```

Run with CUDA:
```
USE_CUDA=True ./run_distributed_examples.sh
```

Adjust index:
```
PIP_INSTALL_ARGS="--pre -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html" \
   ./run_distributed_examples.sh
```

Signed-off-by: Dmitry Rogozhkin <[email protected]>

---------

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Update GitHub Actions workflow and requirements for documentation build

- Upgrade actions/checkout from v2 to v4
- Refactor dependencies installation in the workflow
- Pin sphinx version to 5.3.0 in requirements.txt with descriptions

Signed-off-by: jafraustro <[email protected]>
Refactor GAT example to utilize `torch.accelerator` API `torch.accelerator` API allows to abstract some of the accelerator specifics
in the user scripts. By leveraging this API, the code becomes more adaptable to various hardware accelerators.

Signed-off-by: jafraustro <[email protected]>
* Use torch.acceleratort API in VAE example

* Use torch.accelerator API in VAE examples, fix README
* FSDP2 example

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* update README

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix typo in README

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix README

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
* Fix documentation build: GitHub Actions, Sphinx 5.3.0, and theme compatibility

* simplified workflow and remove redundant configs

* removed venv activation line

* removed unnecessary line
Update the example usage of `torch.load()` with required safe globals.

Signed-off-by: Dmitry Rogozhkin <[email protected]>
* save model in global rank 0 in multinode

* set_epoch only when training
…ccelerator API (#1342)

* Restore default CI configuration for VAE and Siamese examples using Accelerator API

* Update Siamese Readme for consistency with accelerator argument

* Update siamese_network/README.md

Co-authored-by: Dmitry Rogozhkin <[email protected]>

* Update siamese_network/README.md

Co-authored-by: Dmitry Rogozhkin <[email protected]>

* Update siamese_network/main.py

Co-authored-by: Dmitry Rogozhkin <[email protected]>

* Update vae/main.py

Co-authored-by: Dmitry Rogozhkin <[email protected]>

* Improve Readme files for clearer descriptions

* Update Readme file structure to enhance organization

---------

Co-authored-by: Dmitry Rogozhkin <[email protected]>
Signed-off-by: Edgar Romo Montiel <[email protected]>
…after each call to word_language_model/main.py
Update super_resolution example to support accelerate API
* Add Differentiable Physics: Mass-Spring System example

* Add differentiable_physics to run_all() in test script

* Add visualization and update training code in mass_spring.py

* Finalize differentiable_physics with visualization and CI integration

* Finalize differentiable_physics with visualization and CI integration

* Finalize differentiable_physics with the updates

* Update requirements.txt for differentiable_physics

* Update run_python_examples.sh to test differentiable_physics in CI

* Add mass spring example and update requirements

* Add mass spring example and update requirements

* Updated README and visualization from corporate ID (abhitorch81)

* Update readme.md

---------

Co-authored-by: Abhishek Nandy <[email protected]>
Revert "Add Differentiable Physics: Mass-Spring System example (#1359)"

This reverts commit 7c35995.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.