Skip to content

Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 29, 2025

Conversation

peterdsharpe
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

Addresses discussion here: #904 (comment)
@coreyjadams @ktangsali

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@peterdsharpe peterdsharpe requested a review from ktangsali May 22, 2025 19:49
@peterdsharpe peterdsharpe marked this pull request as ready for review May 22, 2025 19:49
@coreyjadams
Copy link
Collaborator

I do not want to make more work for ourselves; but it's worth a check in 2.4.0 that this works:

import torch.distributed as dist
mesh_type = dist.DeviceMesh

@peterdsharpe
Copy link
Collaborator Author

^ Definitely good to double-check. DeviceMesh looks available in 2.4.0 on my machine:

Python 3.10.16 (main, Mar 17 2025, 20:54:03) [MSC v.1943 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.36.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch; torch.__version__
Out[1]: '2.4.0+cpu'

In [2]: import torch.distributed as dist
   ...: mesh_type = dist.DeviceMesh

In [3]:

@coreyjadams coreyjadams self-requested a review May 22, 2025 20:50
Copy link
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks.

jialusui1102 and others added 2 commits May 22, 2025 17:53
NVIDIA#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <[email protected]>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <[email protected]>

* fix small errors in songunet

Signed-off-by: jialusui1102 <[email protected]>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <[email protected]>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <[email protected]>

* update test for lt model

Signed-off-by: jialusui1102 <[email protected]>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <[email protected]>

* update doctest

Signed-off-by: jialusui1102 <[email protected]>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <[email protected]>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: Neal Pan <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
@peterdsharpe
Copy link
Collaborator Author

/blossom-ci

@coreyjadams coreyjadams changed the base branch from main to 1.1.0-rc May 29, 2025 14:35
@coreyjadams
Copy link
Collaborator

/blossom-ci

@coreyjadams coreyjadams merged commit 774c6e9 into NVIDIA:1.1.0-rc May 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants