-
Notifications
You must be signed in to change notification settings - Fork 343
Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…sions (NVIDIA#905) from compatibility issues.
…buted applications.
I do not want to make more work for ourselves; but it's worth a check in 2.4.0 that this works:
|
^ Definitely good to double-check. Python 3.10.16 (main, Mar 17 2025, 20:54:03) [MSC v.1943 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.36.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch; torch.__version__
Out[1]: '2.4.0+cpu'
In [2]: import torch.distributed as dist
...: mesh_type = dist.DeviceMesh
In [3]: |
coreyjadams
approved these changes
May 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks.
NVIDIA#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <[email protected]> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <[email protected]> * Lint and format code properly Signed-off-by: Neal Pan <[email protected]> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <[email protected]> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <[email protected]> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <[email protected]> * update tests Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * initialize global_index directly on device Signed-off-by: jialusui1102 <[email protected]> * formatting Signed-off-by: jialusui1102 <[email protected]> * fix loss arguments in train.py Signed-off-by: jialusui1102 <[email protected]> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <[email protected]> * fix small errors in songunet Signed-off-by: jialusui1102 <[email protected]> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <[email protected]> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <[email protected]> * update test for lt model Signed-off-by: jialusui1102 <[email protected]> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <[email protected]> * update doctest Signed-off-by: jialusui1102 <[email protected]> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <[email protected]> * minor update to arguments and docstring Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: Neal Pan <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>
/blossom-ci |
/blossom-ci |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PhysicsNeMo Pull Request
Description
Addresses discussion here: #904 (comment)
@coreyjadams @ktangsali
Checklist
Dependencies