[DeepSeek] Potential memory bug for noaux_tc? #1030

EugenHotaj · 2025-03-28T22:07:04Z

This is more of a question since I'm not sure if what I'm doing is valid. However, I noticed that noaux_tc uses a lot of memory for DSV3. Specifically, the first step suceeded but the run OOMs on the second step -- this is surprising since we use a constant sequence length and if the first step succeeds, in theory the rest should as well.

Do we need some torch.no_grad() calls when computing the topk indicies similar to what we do in moe_forward()? Adding detach() to the scores like this PR does fixed the OOM and the training curves / grad norms look ~identical for a small run I tried, but I'm sure what I'm doing is valid.

cc @kwen2501 @lessw2020

This is more of a question since I'm not sure if what I'm doing is valid. However, I noticed that `noaux_tc` uses a lot of memory for DSV3. Specifically, the first step suceeded but the run OOMs on the second step -- this is surprising since we use a constant sequence length and if the first step succeeds, in theory the rest should as well. Adding `detach()` to the `scores` here fixed the OOM and the training curves / grad norms look ~identical for a small run I tried, but I'm sure what I'm doing is valid.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 28, 2025

EugenHotaj closed this by deleting the head repository Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeepSeek] Potential memory bug for noaux_tc? #1030

[DeepSeek] Potential memory bug for noaux_tc? #1030

Uh oh!

EugenHotaj commented Mar 28, 2025

Uh oh!

Uh oh!

[DeepSeek] Potential memory bug for noaux_tc? #1030

[DeepSeek] Potential memory bug for noaux_tc? #1030

Uh oh!

Conversation

EugenHotaj commented Mar 28, 2025

Uh oh!

Uh oh!