Skip to content

DLWP Indexing and memory consumption fix #859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

daviddpruitt
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

Fixes indexing issues in couplers and alleviates issue with large CPU memory consumption in dataloader

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

nathanielcresswellclay and others added 30 commits November 22, 2024 18:13
add 'Multi_SymmetricConvNeXtBlock'
…ence_fix

Fix the training and inference problem in nvidia modulus
@daviddpruitt daviddpruitt requested a review from mnabian April 28, 2025 16:23
@loliverhennigh
Copy link
Collaborator

/blossom-ci

@loliverhennigh
Copy link
Collaborator

Don't have much to comment on with this. If it passes CI I would say merge unless there is something specific you want me to look at @daviddpruitt

* Add CELU activation function (NVIDIA#851)

* refactor: updating naming of a few files (modulus -> physicsnemo) (NVIDIA#850)

Co-authored-by: Oliver Hennigh <[email protected]>

* Various Corrdiff optimizations for drastic increase of training efficiency (NVIDIA#809)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <[email protected]>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <[email protected]>

* Lint and format code properly

Signed-off-by: Neal Pan <[email protected]>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <[email protected]>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <[email protected]>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <[email protected]>

* update tests

Signed-off-by: jialusui1102 <[email protected]>

* update changelog

Signed-off-by: jialusui1102 <[email protected]>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <[email protected]>

* formatting

Signed-off-by: jialusui1102 <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>

* Catch improper use of patch gradient accumulation (NVIDIA#868)

* Update train.py to catch improper use of path grad acc

* Update train.py

* Update train.py

* Fixes compile of regression model in train.py

* Removed unused imports

Signed-off-by: Charlelie Laurent <[email protected]>

* Changed grad patch accumulation logic

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Charlelie Laurent <[email protected]>

---------

Signed-off-by: Neal Pan <[email protected]>
Signed-off-by: jialusui1102 <[email protected]>
Signed-off-by: Charlelie Laurent <[email protected]>
Co-authored-by: Yang-yang Tan <[email protected]>
Co-authored-by: Carmelo Gonzales <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: nekobytz <[email protected]>
Co-authored-by: Alicia Sui <[email protected]>
Co-authored-by: jialusui1102 <[email protected]>
Co-authored-by: Charlelie Laurent <[email protected]>
@daviddpruitt daviddpruitt reopened this May 7, 2025
@pzharrington
Copy link
Collaborator

@daviddpruitt has this been covered by #879?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants