Skip to content

Make checkpoint fail_fast feature optional #1310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 18, 2025
Merged

Make checkpoint fail_fast feature optional #1310

merged 4 commits into from
Jun 18, 2025

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Jun 17, 2025

While fail_fast checkpointing feature is useful, it can also waste time and storage when the cluster is already verified with TorchTitan. This PR makes fail_fast feature as optional and defaults to False.

fegin added 2 commits June 16, 2025 19:59
Summary:
While fail_fast checkpointing is useful, it can also waste time and storage when the cluster is verified with TorchTitan. Make fail_fast feature as optional and defaults to False.
@fegin fegin requested review from tianyu-l and wwwjn as code owners June 17, 2025 04:26
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 17, 2025
@fegin fegin requested a review from wwwjn June 18, 2025 00:54
@fegin fegin merged commit f4048f8 into main Jun 18, 2025
7 checks passed
@fegin fegin deleted the disable_fail_fast branch June 18, 2025 04:58
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Jun 26, 2025
While fail_fast checkpointing feature is useful, it can also waste time
and storage when the cluster is already verified with TorchTitan. This
PR makes fail_fast feature as optional and defaults to False.
wwwjn pushed a commit that referenced this pull request Jul 1, 2025
While fail_fast checkpointing feature is useful, it can also waste time
and storage when the cluster is already verified with TorchTitan. This
PR makes fail_fast feature as optional and defaults to False.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants