Skip to content

[DeepSeek] Potential memory bug for noaux_tc? #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

EugenHotaj
Copy link
Contributor

This is more of a question since I'm not sure if what I'm doing is valid. However, I noticed that noaux_tc uses a lot of memory for DSV3. Specifically, the first step suceeded but the run OOMs on the second step -- this is surprising since we use a constant sequence length and if the first step succeeds, in theory the rest should as well.

Do we need some torch.no_grad() calls when computing the topk indicies similar to what we do in moe_forward()? Adding detach() to the scores like this PR does fixed the OOM and the training curves / grad norms look ~identical for a small run I tried, but I'm sure what I'm doing is valid.

Screenshot 2025-03-28 at 2 43 02 PM

cc @kwen2501 @lessw2020

This is more of a question since I'm not sure if what I'm doing is valid.
However, I noticed that `noaux_tc` uses a lot of memory for DSV3. Specifically,
the first step suceeded but the run OOMs on the second step -- this is surprising since
we use a constant sequence length and if the first step succeeds, in theory the rest should as well.

Adding `detach()` to the `scores` here fixed the OOM and the training curves / grad norms look
~identical for a small run I tried, but I'm sure what I'm doing is valid.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants