test allreduce failures for diloco #226

tushar00jain · 2025-06-24T21:15:08Z

Summary:

test when allreduce fails but no new nodes join
added another event of type AllreduceFailure
This new event required modifying some manager code to inject the failure

d4l3k · 2025-06-24T21:24:57Z

torchft/manager.py

+        # used to artificially fail the next allreduce by tests
+        self._TEST_should_fail_allreduce = False
+
+    def TEST_fail_allreduce(self) -> None:


can we create a wrapped/mocked PG that injects the failure instead?

yeah let me see if we can create a wrapper so we can keep using real pg. thinking of using mocked pg for deterministic simulation

agree that we should try to keep test related functionality contained in our test files!

Summary: - test when allreduce fails but no new nodes join - added another event of type `AllreduceFailure` - This new event required modifying some manager code to inject the failure

H-Huang · 2025-06-28T14:31:31Z

torchft/process_group.py

@@ -1016,6 +1023,55 @@ def allreduce(self, tensors: List[torch.Tensor], opts: object) -> Work:
            return _DummyWork(tensors)


+class FakeProcessGroupWrapper(ProcessGroupWrapper):


nit: could maybe move this class into tests?

yeah there's a bunch of other fakes in this file. maybe take this up separately? we also want to upstream process_group to pytorch

H-Huang

LGTM

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025

tushar00jain force-pushed the pr226 branch from 853dcd7 to a5396a8 Compare June 24, 2025 21:17

tushar00jain requested review from d4l3k and H-Huang June 24, 2025 21:19

d4l3k reviewed Jun 24, 2025

View reviewed changes

test allreduce failures for diloco

1403640

Summary: - test when allreduce fails but no new nodes join - added another event of type `AllreduceFailure` - This new event required modifying some manager code to inject the failure

tushar00jain force-pushed the pr226 branch from a5396a8 to 1403640 Compare June 27, 2025 22:10

H-Huang reviewed Jun 28, 2025

View reviewed changes

H-Huang approved these changes Jun 28, 2025

View reviewed changes

tushar00jain merged commit 1682257 into pytorch:main Jun 30, 2025
15 checks passed

tushar00jain deleted the pr226 branch June 30, 2025 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test allreduce failures for diloco #226

test allreduce failures for diloco #226

Uh oh!

tushar00jain commented Jun 24, 2025 •

edited

Loading

Uh oh!

d4l3k Jun 24, 2025

Uh oh!

tushar00jain Jun 24, 2025

Uh oh!

H-Huang Jun 25, 2025

Uh oh!

H-Huang Jun 28, 2025

Uh oh!

tushar00jain Jun 30, 2025

Uh oh!

H-Huang left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -1016,6 +1023,55 @@ def allreduce(self, tensors: List[torch.Tensor], opts: object) -> Work:
		return _DummyWork(tensors)


		class FakeProcessGroupWrapper(ProcessGroupWrapper):

test allreduce failures for diloco #226

test allreduce failures for diloco #226

Uh oh!

Conversation

tushar00jain commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jun 24, 2025 •

edited

Loading