DLWP Indexing and memory consumption fix #859

daviddpruitt · 2025-04-28T16:23:34Z

PhysicsNeMo Pull Request

Description

Fixes indexing issues in couplers and alleviates issue with large CPU memory consumption in dataloader

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

…ain branch

add WeightedOceanMSE to criterion

…aining - should improve coupled stability

Signed-off-by: root <[email protected]>

Update indices in constant coupler.

Gaussian Noise to Coupled Training

Signed-off-by: root <[email protected]>

add 'Multi_SymmetricConvNeXtBlock'

Update workflow.

…ence_fix Fix the training and inference problem in nvidia modulus

Fix memory leak in coupled timeseries

…ain branch

Signed-off-by: root <[email protected]>

Rebase physicsnemo

loliverhennigh · 2025-05-01T18:54:59Z

/blossom-ci

loliverhennigh · 2025-05-01T18:58:14Z

Don't have much to comment on with this. If it passes CI I would say merge unless there is something specific you want me to look at @daviddpruitt

* Add CELU activation function (NVIDIA#851) * refactor: updating naming of a few files (modulus -> physicsnemo) (NVIDIA#850) Co-authored-by: Oliver Hennigh <[email protected]> * Various Corrdiff optimizations for drastic increase of training efficiency (NVIDIA#809) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <[email protected]> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <[email protected]> * Lint and format code properly Signed-off-by: Neal Pan <[email protected]> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <[email protected]> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <[email protected]> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <[email protected]> * update tests Signed-off-by: jialusui1102 <[email protected]> * update changelog Signed-off-by: jialusui1102 <[email protected]> * initialize global_index directly on device Signed-off-by: jialusui1102 <[email protected]> * formatting Signed-off-by: jialusui1102 <[email protected]> --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: jialusui1102 <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]> * Catch improper use of patch gradient accumulation (NVIDIA#868) * Update train.py to catch improper use of path grad acc * Update train.py * Update train.py * Fixes compile of regression model in train.py * Removed unused imports Signed-off-by: Charlelie Laurent <[email protected]> * Changed grad patch accumulation logic Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Charlelie Laurent <[email protected]> --------- Signed-off-by: Neal Pan <[email protected]> Signed-off-by: jialusui1102 <[email protected]> Signed-off-by: Charlelie Laurent <[email protected]> Co-authored-by: Yang-yang Tan <[email protected]> Co-authored-by: Carmelo Gonzales <[email protected]> Co-authored-by: Oliver Hennigh <[email protected]> Co-authored-by: nekobytz <[email protected]> Co-authored-by: Alicia Sui <[email protected]> Co-authored-by: jialusui1102 <[email protected]> Co-authored-by: Charlelie Laurent <[email protected]>

pzharrington · 2025-05-09T00:00:03Z

@daviddpruitt has this been covered by #879?

nathanielcresswellclay and others added 30 commits November 22, 2024 18:13

Add workflow to automatically sync changes from nvidia/modulus into m…

3b15d51

…ain branch

add WeightedOceanMSE to criterion

8f0cc02

Merge pull request #1 from AtmosSci-DLESM/WeightedOceanMSE

c439e00

add WeightedOceanMSE to criterion

add optional gaussian noise to inputs and coupled variables during tr…

47fdce8

…aining - should improve coupled stability

add random seed - still need to test

2826a02

remove datatransformer code - shouldn't be part of this PR

e44dcc9

move logging

f73e149

Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

50041ef

Fix the training and inference problem in nvidia modulus

e2f3376

Signed-off-by: root <[email protected]>

Fix indexing in constant coupler

b8951be

add 'Multi_SymmetricConvNeXtBlock'

20170ad

Repalce 'n_layers' with 'n_conv_blocks' for clarity

68ca5bf

Merge pull request #8 from AtmosSci-DLESM/fix_indexing_coupler

0c3288a

Update indices in constant coupler.

Merge pull request #2 from AtmosSci-DLESM/GaussianNoiseCoupled

c81405e

Gaussian Noise to Coupled Training

Fix indexing in constant coupler

1317c02

Signed-off-by: root <[email protected]>

Fix indexing in constant coupler

027473b

Signed-off-by: root <[email protected]>

change back to 'n_layers' to match the old models

aff387e

Merge pull request #9 from Bwformer/bw/DoubleConv

80821ec

add 'Multi_SymmetricConvNeXtBlock'

enforce precedence of upstream modulus changes when auto syncing.

5726c7b

Merge pull request #3 from AtmosSci-DLESM/update_workflow

4574532

Update workflow.

Merge branch 'main' into dlwp_coupled_training_inference_fix

be854fd

set scaling for mean: 0, std: 1 where no change is needed

957b00a

Merge branch 'dev' into dlwp_coupled_training_inference_fix

d519a84

Merge pull request #7 from AtmosSci-DLESM/dlwp_coupled_training_infer…

afb4e63

…ence_fix Fix the training and inference problem in nvidia modulus

fix memory leak in coupled timeseries

3a3e9d1

Merge pull request #14 from AtmosSci-DLESM/yc/mem_leak_coupledtimeseries

cd81224

Fix memory leak in coupled timeseries

Add workflow to automatically sync changes from nvidia/modulus into m…

6d34145

…ain branch

Fix the training and inference problem in nvidia modulus

2b9008d

Signed-off-by: root <[email protected]>

Fix indexing in constant coupler

c8ff21a

Signed-off-by: root <[email protected]>

Fix indexing in constant coupler

e0e9d9b

Signed-off-by: root <[email protected]>

nathanielcresswellclay and others added 14 commits April 24, 2025 10:06

Removed blossom-ci workflow from modulus-uw fork, updated automatic sync

e12861c

enforce precedence of upstream modulus changes when auto syncing.

1dc4783

set scaling for mean: 0, std: 1 where no change is needed

5fac4e8

add 'Multi_SymmetricConvNeXtBlock'

9c5c9e8

Repalce 'n_layers' with 'n_conv_blocks' for clarity

5c27fba

change back to 'n_layers' to match the old models

2cd0cec

fix memory leak in coupled timeseries

7c92ab2

Merge branch 'dev' into rebase_physicsnemo

bfa1489

add coupler fixes, var and time selection

65d8483

Fix for ordering on coupler

4d7f5c9

batch size fix in coupler

3541167

broken workflow cleanup

ed33b0c

Merge pull request #19 from AtmosSci-DLESM/rebase_physicsnemo

6bd06dc

Rebase physicsnemo

cleanup for upstream merge (#20)

f5ba5ff

daviddpruitt requested a review from mnabian April 28, 2025 16:23

daviddpruitt closed this May 7, 2025

daviddpruitt reopened this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLWP Indexing and memory consumption fix #859

DLWP Indexing and memory consumption fix #859

Uh oh!

daviddpruitt commented Apr 28, 2025

Uh oh!

loliverhennigh commented May 1, 2025

Uh oh!

loliverhennigh commented May 1, 2025

Uh oh!

pzharrington commented May 9, 2025

Uh oh!

Uh oh!

DLWP Indexing and memory consumption fix #859

Are you sure you want to change the base?

DLWP Indexing and memory consumption fix #859

Uh oh!

Conversation

daviddpruitt commented Apr 28, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Uh oh!

loliverhennigh commented May 1, 2025

Uh oh!

loliverhennigh commented May 1, 2025

Uh oh!

pzharrington commented May 9, 2025

Uh oh!

Uh oh!