[pull] main from pytorch:main #3

pull · 2022-09-13T22:58:31Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

typo infinitely

Revert "Add mps device (#1018)" This reverts commit 616caed.

* Add mps device * Add --mps to run_python_examples.sh * Update imagenet with mps device * Use curl in run_python_examples.sh to accommodate macOS devices * Fix for https://github.com/pytorchq/examples/issues/1060

* Update .gitignore

* Add code for multinode training on slurm * filtered-clone examples, update script path * python3.7 -> python3

* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml

After training a model in Mac with mps option, when trying to run the generate script it is giving a Runtime error "Placeholder storage has not been allocated on MPS device". To avoid this issue, this change is made

* fix device mismatch issue #1071

Add set_epoch for shuffling inputs, fix arg order

…ORCE (#1083) Replace list with deque to obtain O(1) time complexity of insertion at the beginning of the list of returns

Fixes #1078

* Example of MNIST using RNN * Example of MNIST using RNN: Changed RNN type to LSTM and changed variable names * Example of MNIST using RNN: Resolving review comments * Example of MNIST using RNN: Removing unintentional new line

* fix device mismatch issue #1071 * fix device mismatch issue #1071 * add mnist_rnn to test script for CI * support dry_run in test()

fixes #1093

val data should not shuffle

…PI and deprecate the old one (#1099) * [PT-D][Tensor Parallel] Update the example for TP to use DTensor and new TP API

* word language model on Jetson NX When running the word language model on Jetson NX, the original main.py fails caused by that torch (NVIDIA offical pytorch docker image: `l4t-pytorch:r35.1.0-pth1.11-py3`) do not have the `mps` backend. This modification has fixed the problem. * Update word_language_model/main.py That's better! Co-authored-by: Steven Liu <[email protected]> Co-authored-by: Steven Liu <[email protected]>

set type of batch_size argument to int

Summary: 1. Pick specific version of torchvision to fix dependency errors 2. Pin numpy to be below version 2. 3. Update Python version in python tests. Test Plan: Tested locally.

Summary: Fix up the FSDP tutorial to get it functional again. 1. Add missing import for load_dataset. 2. Use `checkpoint` instead of `_shard.checkpoint` to get rid of a warning. 3. Add nlp to requirements.txt 4. Get rid of `load_metric` as this function does not exist in new `datasets` module. 5. Add `legacy=False` to get rid of tokenizer warnings. Test Plan: Ran the tutorial as follows and ensured that it ran successfully: ``` torchrun --nnodes=1 --nproc_per_node=2 T5_training.py W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] ***************************************** W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] ***************************************** dict_keys(['train', 'validation', 'test']) Size of train dataset: (157252, 3) Size of Validation dataset: (5599, 3) dict_keys(['train', 'validation', 'test']) Size of train dataset: (157252, 3) Size of Validation dataset: (5599, 3) bFloat16 enabled for mixed precision - using bfSixteen policy ```

correct `model.train` description to be 'Put model into training mode' as opposed to 'Put model into inference mode'

* Add requirements.txt to examples which miss them Signed-off-by: Dmitry Rogozhkin <[email protected]> * Update numpy requirement for reinforcement_learning to be <2 Current version of the example requires `numpy<2` otherwise the following error can be seen: ``` AttributeError: module 'numpy' has no attribute 'bool8'. Did you mean: 'bool'? ``` Signed-off-by: Dmitry Rogozhkin <[email protected]> * Update torch requirement for time and word examples to be <2.6 Current version of examples require `torch<2.6` otherwise the following error can be seen: ``` File "/pytorch/examples/time_sequence_prediction/train.py", line 47, in <module> data = torch.load('traindata.pt') ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/pytorch/examples/time_sequence_prediction/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1524, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None ``` Signed-off-by: Dmitry Rogozhkin <[email protected]> * Respect each example requirements and use uv This commit introduces few changes to CI by modifying `run_*_examples.sh` and respective github workflows: * Switched to uv * Added tearup and teardown stages for tests (`start()` and `stop()` methods wrapping up test bodies - these are called automatically) * Tearup (`start()`) installs example dependencies and, optionally (if `VIRTUAL_ENV=.venv` is passed), creates uv virtual environment * Teardown (`stop()`) removes uv virtual environment if it was created (to save space) * If no `VIRTUAL_ENV` set, then scripts expect to be executed in the existing virtual environment. These can be `python -m venv`, `uv env` or `conda env`. In this case example dependencies will be installed in this environment potentially reinstalling existing packages (including `torch`!). * Dropped automated detection of CUDA platform. Now scripts require `USE_CUDA=True` to be passed explicitly * Added `PIP_INSTALL_ARGS` environment variable to be passed to `uv pip install` calls for each example dependencies. This allows to adjust torch indices and other options. Execute all tests in current virtual environment (might rewrite packages): ``` ./run_distributed_examples.sh ``` Execute all tests creating separate environment for each example: ``` VIRTUAL_ENV=.venv ./run_distributed_examples.sh ``` Run with CUDA: ``` USE_CUDA=True ./run_distributed_examples.sh ``` Adjust index: ``` PIP_INSTALL_ARGS="--pre -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html" \ ./run_distributed_examples.sh ``` Signed-off-by: Dmitry Rogozhkin <[email protected]> --------- Signed-off-by: Dmitry Rogozhkin <[email protected]>

Update GitHub Actions workflow and requirements for documentation build - Upgrade actions/checkout from v2 to v4 - Refactor dependencies installation in the workflow - Pin sphinx version to 5.3.0 in requirements.txt with descriptions Signed-off-by: jafraustro <[email protected]>

Refactor GAT example to utilize `torch.accelerator` API `torch.accelerator` API allows to abstract some of the accelerator specifics in the user scripts. By leveraging this API, the code becomes more adaptable to various hardware accelerators. Signed-off-by: jafraustro <[email protected]>

* Use torch.acceleratort API in VAE example * Use torch.accelerator API in VAE examples, fix README

* FSDP2 example Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * update README Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix typo in README Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix README Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

* Fix documentation build: GitHub Actions, Sphinx 5.3.0, and theme compatibility * simplified workflow and remove redundant configs * removed venv activation line * removed unnecessary line

Signed-off-by: dggaytan <[email protected]>

Update the example usage of `torch.load()` with required safe globals. Signed-off-by: Dmitry Rogozhkin <[email protected]>

* save model in global rank 0 in multinode * set_epoch only when training

…ccelerator API (#1342) * Restore default CI configuration for VAE and Siamese examples using Accelerator API * Update Siamese Readme for consistency with accelerator argument * Update siamese_network/README.md Co-authored-by: Dmitry Rogozhkin <[email protected]> * Update siamese_network/README.md Co-authored-by: Dmitry Rogozhkin <[email protected]> * Update siamese_network/main.py Co-authored-by: Dmitry Rogozhkin <[email protected]> * Update vae/main.py Co-authored-by: Dmitry Rogozhkin <[email protected]> * Improve Readme files for clearer descriptions * Update Readme file structure to enhance organization --------- Co-authored-by: Dmitry Rogozhkin <[email protected]>

Signed-off-by: eromomon <[email protected]>

Signed-off-by: Edgar Romo Montiel <[email protected]>

…order Co-authored-by: Dmitry Rogozhkin <[email protected]>

…after each call to word_language_model/main.py

Update super_resolution example to support accelerate API

* Add Differentiable Physics: Mass-Spring System example * Add differentiable_physics to run_all() in test script * Add visualization and update training code in mass_spring.py * Finalize differentiable_physics with visualization and CI integration * Finalize differentiable_physics with visualization and CI integration * Finalize differentiable_physics with the updates * Update requirements.txt for differentiable_physics * Update run_python_examples.sh to test differentiable_physics in CI * Add mass spring example and update requirements * Add mass spring example and update requirements * Updated README and visualization from corporate ID (abhitorch81) * Update readme.md --------- Co-authored-by: Abhishek Nandy <[email protected]>

Revert "Add Differentiable Physics: Mass-Spring System example (#1359)" This reverts commit 7c35995.

Update (#1058)

dc51eb1

pull bot added the ⤵️ pull label Sep 13, 2022

vfdev-5 and others added 28 commits September 14, 2022 10:19

Fixed RL examples to work with new gym API (#1051)

8428996

Update actor_critic.py typo (#1048)

d5d9de6

typo infinitely

fix imagenet nondeterminism when seed is set (#1056) (#1057)

32c15f2

Fix the wrong link in issues-template (#1045)

71a545a

Add mps device (#1018)

616caed

Revert "Add mps device" (#1061)

5a06e9c

Revert "Add mps device (#1018)" This reverts commit 616caed.

Add mps device (#1064)

f82f562

* Add mps device * Add --mps to run_python_examples.sh * Update imagenet with mps device * Use curl in run_python_examples.sh to accommodate macOS devices * Fix for https://github.com/pytorchq/examples/issues/1060

Mac enhancement .gitignore to ignore .DS_* (#1066)

517eb80

* Update .gitignore

Add code for DDP tutorial series [PR 1 / 3] (#1067)

f45e418

Add code for DDP tutorial series [PR 2 / 3] (#1068)

84b7588

* Add code for multinode training on slurm * filtered-clone examples, update script path * python3.7 -> python3

Add code for DDP tutorial series [PR 3 / 3] (#1069)

d91085d

* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml

Added mps option in generate.py (#1070)

35eb814

After training a model in Mac with mps option, when trying to run the generate script it is giving a Runtime error "Placeholder storage has not been allocated on MPS device". To avoid this issue, this change is made

Fix device mismatch issue in #1071 (#1073)

f5bb60f

* fix device mismatch issue #1071

Add set_epoch and fix args in DDP-series example (#1076)

2ee8d43

Add set_epoch for shuffling inputs, fix arg order

Add map_location parameter in DDP-series example (#1078) (#1079)

ca1bd91

Speed-up to O(1) from O(N) of the computation of each return in REINF…

74a70e1

…ORCE (#1083) Replace list with deque to obtain O(1) time complexity of insertion at the beginning of the list of returns

Add map_location when loading model checkpoint (#1084)

a8cf0b8

Fixes #1078

Example of MNIST using RNN (#752)

5d4b584

* Example of MNIST using RNN * Example of MNIST using RNN: Changed RNN type to LSTM and changed variable names * Example of MNIST using RNN: Resolving review comments * Example of MNIST using RNN: Removing unintentional new line

add mnist_rnn to test script for CI (#1086)

9aad148

* fix device mismatch issue #1071 * fix device mismatch issue #1071 * add mnist_rnn to test script for CI * support dry_run in test()

[fix CI] replace assert_allclose with assert_close (#1091)

d304b0d

Remove unused code from /cpp/transfer-learning/classify.cpp (#1092)

e6cba0a

Update trainer.py (#1098)

1b36393

fixes #1093

Update main.py (#1095)

387ce7b

val data should not shuffle

[PT-D][Tensor Parallel] Update the example for TP to use the new TP A…

50f5570

…PI and deprecate the old one (#1099) * [PT-D][Tensor Parallel] Update the example for TP to use DTensor and new TP API

Fix CI error in example (#1100)

63fc276

Fix exception causes in word_language_model/model.py (#1102)

f8401e9

set type of batch_size argument to int in ddp-tutorial-series (#1104)

244e4ee

set type of batch_size argument to int

c-p-i-o and others added 30 commits November 5, 2024 13:23

Fix python failing tests (#1299)

47d0c2e

Summary: 1. Pick specific version of torchvision to fix dependency errors 2. Pin numpy to be below version 2. 3. Update Python version in python tests. Test Plan: Tested locally.

fix typo (#1272)

5dfeb46

correct `model.train` description to be 'Put model into training mode' as opposed to 'Put model into inference mode'

Add support for Intel GPU to Fast Neural Style example (#1318)

8393ceb

Use torch.accelerator API in Fast Neural Style example (#1327)

65afde6

Use torch.accelerator API in mnist examples (#1334)

65722fe

Update tensor_parallel_example.py (#1324)

b7aebb5

Use torch.accelerator API in Siamese Network example (#1337)

12dc18e

Use torch.accelerator API in VAE example (#1338)

54e132e

* Use torch.acceleratort API in VAE example * Use torch.accelerator API in VAE examples, fix README

This PR Improve docs build ci (#1336)

7028a2e

* Fix documentation build: GitHub Actions, Sphinx 5.3.0, and theme compatibility * simplified workflow and remove redundant configs * removed venv activation line * removed unnecessary line

Use torch.accelerator in DCGAN example (#1344)

cc8e404

Signed-off-by: dggaytan <[email protected]>

Support torch>=2.6 in word_language_model example (#1347)

ac7e960

Update the example usage of `torch.load()` with required safe globals. Signed-off-by: Dmitry Rogozhkin <[email protected]>

Fix super_resolution example for torch>=2.6 (#1350)

abfa4f9

save model in global rank 0 in multinode (#1357)

5cc81aa

* save model in global rank 0 in multinode * set_epoch only when training

fix typo which make tgt tensor data wrong (#1356)

3ddcc89

Add Accelerator Api to Imagenet Example

dcc2474

Signed-off-by: eromomon <[email protected]>

Removing unneeded print

37986af

Signed-off-by: Edgar Romo Montiel <[email protected]>

Fix torch>= 2.6 generate.py compatibility in Word_language_example

a86bda0

Update word_language_model/generate.py to remove duplicates, use abc …

6595d7b

…order Co-authored-by: Dmitry Rogozhkin <[email protected]>

Change run_python_examples.sh to run word_language_model/generate.py …

2944a9d

…after each call to word_language_model/main.py

Add accelerate API support for Super Resolution example (#1358)

6f61614

Update super_resolution example to support accelerate API

Revert "Add Differentiable Physics: Mass-Spring System example" (#1360)

8d408d2

Revert "Add Differentiable Physics: Mass-Spring System example (#1359)" This reverts commit 7c35995.

imagenet: fix typo addressing args.gpu (#1361)

16554e5

Fixed use_accel not defined issue (#1363)

4206858

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] main from pytorch:main #3

[pull] main from pytorch:main #3

Uh oh!

pull bot commented Sep 13, 2022 •

edited

Loading

Uh oh!

Uh oh!

[pull] main from pytorch:main #3

Are you sure you want to change the base?

[pull] main from pytorch:main #3

Uh oh!

Conversation

pull bot commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Sep 13, 2022 •

edited

Loading