-
Notifications
You must be signed in to change notification settings - Fork 37
test allreduce failures for diloco #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torchft/manager.py
Outdated
# used to artificially fail the next allreduce by tests | ||
self._TEST_should_fail_allreduce = False | ||
|
||
def TEST_fail_allreduce(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a wrapped/mocked PG that injects the failure instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let me see if we can create a wrapper so we can keep using real pg. thinking of using mocked pg for deterministic simulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that we should try to keep test related functionality contained in our test files!
Summary: - test when allreduce fails but no new nodes join - added another event of type `AllreduceFailure` - This new event required modifying some manager code to inject the failure
@@ -1016,6 +1023,55 @@ def allreduce(self, tensors: List[torch.Tensor], opts: object) -> Work: | |||
return _DummyWork(tensors) | |||
|
|||
|
|||
class FakeProcessGroupWrapper(ProcessGroupWrapper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could maybe move this class into tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there's a bunch of other fakes in this file. maybe take this up separately? we also want to upstream process_group to pytorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary:
AllreduceFailure