Skip to content

test allreduce failures for diloco #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2025
Merged

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 24, 2025

Summary:

  • test when allreduce fails but no new nodes join
  • added another event of type AllreduceFailure
  • This new event required modifying some manager code to inject the failure

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025
@tushar00jain tushar00jain requested review from d4l3k and H-Huang June 24, 2025 21:19
# used to artificially fail the next allreduce by tests
self._TEST_should_fail_allreduce = False

def TEST_fail_allreduce(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a wrapped/mocked PG that injects the failure instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let me see if we can create a wrapper so we can keep using real pg. thinking of using mocked pg for deterministic simulation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that we should try to keep test related functionality contained in our test files!

Summary:
- test when allreduce fails but no new nodes join
- added another event of type `AllreduceFailure`
- This new event required modifying some manager code to inject the failure
@@ -1016,6 +1023,55 @@ def allreduce(self, tensors: List[torch.Tensor], opts: object) -> Work:
return _DummyWork(tensors)


class FakeProcessGroupWrapper(ProcessGroupWrapper):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could maybe move this class into tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there's a bunch of other fakes in this file. maybe take this up separately? we also want to upstream process_group to pytorch

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tushar00jain tushar00jain merged commit 1682257 into pytorch:main Jun 30, 2025
15 checks passed
@tushar00jain tushar00jain deleted the pr226 branch June 30, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants