Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

peterdsharpe · 2025-05-22T19:47:53Z

PhysicsNeMo Pull Request

Description

Addresses discussion here: #904 (comment)
@coreyjadams @ktangsali

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

…sions (NVIDIA#905) from compatibility issues.

…buted applications.

coreyjadams · 2025-05-22T19:52:45Z

I do not want to make more work for ourselves; but it's worth a check in 2.4.0 that this works:

import torch.distributed as dist
mesh_type = dist.DeviceMesh

peterdsharpe · 2025-05-22T20:23:13Z

^ Definitely good to double-check. DeviceMesh looks available in 2.4.0 on my machine:

Python 3.10.16 (main, Mar 17 2025, 20:54:03) [MSC v.1943 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.36.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch; torch.__version__
Out[1]: '2.4.0+cpu'

In [2]: import torch.distributed as dist
   ...: mesh_type = dist.DeviceMesh

In [3]:

coreyjadams

LGTM - thanks.

NVIDIA#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <[email protected]> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <[email protected]> * Lint and format code properly Signed-off-by: Neal Pan <[email protected]> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <[email protected]> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <[email protected]> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <[email protected]> * update tests Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * initialize global_index directly on device Signed-off-by: jialusui1102 <[email protected]> * formatting Signed-off-by: jialusui1102 <[email protected]> * fix loss arguments in train.py Signed-off-by: jialusui1102 <[email protected]> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <[email protected]> * fix small errors in songunet Signed-off-by: jialusui1102 <[email protected]> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <[email protected]> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <[email protected]> * update test for lt model Signed-off-by: jialusui1102 <[email protected]> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <[email protected]> * update doctest Signed-off-by: jialusui1102 <[email protected]> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <[email protected]> * minor update to arguments and docstring Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: Neal Pan <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

peterdsharpe · 2025-05-23T13:27:45Z

/blossom-ci

Fix lint error

coreyjadams · 2025-05-29T15:25:33Z

/blossom-ci

coreyjadams and others added 3 commits May 22, 2025 10:10

Wrap DeviceMesh in quotes for typing hint, to protect older torch ver…

be4f507

…sions (NVIDIA#905) from compatibility issues.

Bumps torch version to >=2.4.0 to minimize support surface for distri…

1421759

…buted applications.

Adds changelog note

1b52e88

peterdsharpe requested a review from ktangsali May 22, 2025 19:49

peterdsharpe marked this pull request as ready for review May 22, 2025 19:49

coreyjadams self-requested a review May 22, 2025 20:50

coreyjadams approved these changes May 22, 2025

View reviewed changes

jialusui1102 and others added 2 commits May 22, 2025 17:53

Merge branch 'main' into psharpe/bump-torch

b4d1d40

coreyjadams changed the base branch from main to 1.1.0-rc May 29, 2025 14:35

coreyjadams added 2 commits May 29, 2025 09:57

Merge branch '1.1.0-rc' into psharpe/bump-torch

de09815

Update CHANGELOG.md

7b9f159

Fix lint error

coreyjadams merged commit 774c6e9 into NVIDIA:1.1.0-rc May 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

Uh oh!

peterdsharpe commented May 22, 2025

Uh oh!

coreyjadams commented May 22, 2025

Uh oh!

peterdsharpe commented May 22, 2025

Uh oh!

coreyjadams left a comment

Uh oh!

peterdsharpe commented May 23, 2025

Uh oh!

coreyjadams commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

Bumps torch version to >=2.4.0 to minimize support surface for distributed applications #906

Uh oh!

Conversation

peterdsharpe commented May 22, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Uh oh!

coreyjadams commented May 22, 2025

Uh oh!

peterdsharpe commented May 22, 2025

Uh oh!

coreyjadams left a comment

Choose a reason for hiding this comment

Uh oh!

peterdsharpe commented May 23, 2025

Uh oh!

coreyjadams commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!