Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Nov 27, 2025

What does this PR do?

  • Adds a CUDA-only integration test that mirrors the reporter’s compiled ModelParallel setup so the KeyError('model.0.weight') reproduces in CI.
  • Fixes [ModelParallelStrategy.optimizer_state] so when torch.compile wraps the module, optimizer states get rekeyed through both the compiled wrapper and the original module before single-file checkpointing, preventing the KeyError.
  • Documents the fix in the unreleased changelog.

Fixes #21357

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes? (CUDA test runs in CI; CPU run skips as expected)
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 27, 2025
@littlebullGit littlebullGit marked this pull request as ready for review November 27, 2025 03:23
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from 82f9a7d to db3d718 Compare November 27, 2025 03:48
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from db3d718 to d4e476f Compare November 27, 2025 04:57
@littlebullGit littlebullGit changed the title Add regression test for ModelParallel single-file checkpoint CUDA test to reproduce: ModelParallelStrategy fails with non-distributed checkpoint. #21357 Nov 27, 2025
@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (b09e96e) to head (cd663c6).
⚠️ Report is 2 commits behind head on master.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (b09e96e) and HEAD (cd663c6). Click for more details.

HEAD has 642 uploads less than BASE
Flag BASE (b09e96e) HEAD (cd663c6)
cpu 176 30
lightning_fabric 44 0
pytest 88 0
python3.12 53 9
python3.12.7 52 9
lightning 87 15
python3.11 36 6
python3.10 17 3
python 18 3
pytorch_lightning 45 15
pytorch2.7 9 3
pytest-full 88 30
pytorch2.2.2 9 3
pytorch2.6 9 3
pytorch2.4.1 8 3
pytorch2.5.1 8 3
pytorch2.8 9 3
pytorch2.1 18 6
pytorch2.3 9 3
pytorch2.9 9 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21384     +/-   ##
=========================================
- Coverage      89%      79%    -11%     
=========================================
  Files         269      266      -3     
  Lines       22063    23775   +1712     
=========================================
- Hits        19737    18752    -985     
- Misses       2326     5023   +2697     

@bhimrazy bhimrazy marked this pull request as draft November 27, 2025 07:34
@littlebullGit littlebullGit changed the title CUDA test to reproduce: ModelParallelStrategy fails with non-distributed checkpoint. #21357 Fix ModelParallelStrategy fails with non-distributed checkpoint. #21384 Nov 27, 2025
@littlebullGit littlebullGit marked this pull request as ready for review November 27, 2025 16:08
@littlebullGit littlebullGit changed the title Fix ModelParallelStrategy fails with non-distributed checkpoint. #21384 Fix ModelParallelStrategy fails with non-distributed checkpoint. Nov 27, 2025
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch 2 times, most recently from 31b0976 to 646e01b Compare November 27, 2025 19:42
@littlebullGit littlebullGit force-pushed the fix/21357-modelparallel-checkpoint branch from 90af5cb to cd663c6 Compare November 28, 2025 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelParallelStrategy fails with non-distributed checkpoint.

1 participant