compile: turn off fullgraph=True to support llama4 #1182

bdhirsh · 2025-05-12T16:03:09Z

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile

CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile

Stack from ghstack (oldest at bottom):

-> compile: turn off fullgraph=True to support llama4 #1182

[ghstack-poisoned]

ghstack-source-id: 98bb0ed Pull Request resolved: #1182

fegin · 2025-05-12T16:57:52Z

torchtitan/models/llama3/parallelize_llama.py

@@ -304,7 +304,7 @@ def apply_compile(model: nn.Module):
    repeated structure. Alternatively one can compile the whole model (after applying DP).
    """
    for layer_id, transformer_block in model.layers.named_children():
-        transformer_block = torch.compile(transformer_block, fullgraph=True)


Can we add a comment/TODO to remind us to turn it back on when issues are resolved?

oh will do. there are two related things here:

(1) grouped_mm support in compile, which torchtitan uses in llama4. I added basic support in core in this PR: pytorch/pytorch#153384

(2) E2E llama4 + compile in torchtitan. The current reason this completely blows up today is that torchtitan's llama4 + FSDP2 integration requires wrapping the MoE layer, which requires installing backward hooks around the MoE layer. Compile does not support compiling backward hooks (we graph break), and so we need to do one of these options:

(a) allow the graph break (turn off fullgraph=True)

(b) tweak torchtitan so that instead of compiling each transformer layer, we compile MoE layers separately, and compile the rest of the transformer block layer separately as well.

I also mentioned this to @tianyu-l but calling it out here: (a) is easier to do, so I'm doing it here, but it does have the risk that if any changes are made to core that increases the number of graph breaks in torchtitan, we won't error as loudly (we may see a perf drop instead). (b) is probably better to do at some point, I'm just doing the simpler thing here.

Are folks working on torchtitan interested in running benchmarks for titan + llama4 (with compile on/off?)

@fegin I actually tweaked the PR so that we still fullgraph=True compile the "regular" transformer blocks, and only fullgraph=False the blocks with MoE layers. I think this should reduce the risk we hit regressions, so this may be a reasonable long term solution (when using FSDP2 in torchtitan), as long as we see reasonable perf numbers.

@bdhirsh Thanks for the thorough explanation. The 8GPU integration test is timeout. Since llama4 is not in the integration test, the integrate test issue should not be caused by this PR. I still relaunch the test but feel free to land it.

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile ``` CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile ``` [ghstack-poisoned]

ghstack-source-id: f22f920 Pull Request resolved: #1182

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile ``` CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile ``` [ghstack-poisoned]

ghstack-source-id: cd16b65 Pull Request resolved: #1182

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile ``` CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile ``` [ghstack-poisoned]

ghstack-source-id: 1539e21 Pull Request resolved: #1182

tianyu-l

The current reason this completely blows up today is that torchtitan's llama4 + FSDP2 integration requires wrapping the MoE layer, which requires installing backward hooks around the MoE layer.

hmm let's be more careful here. This is only true when EP is used, specifically dp2ep (e.g. in #732). Currently in Llama 4, EP is not supported yet, which means we are doing homogeneous FSDP2 wrapping to all the transformer blocks only (not to MoE modules). So I suppose the full graph compilation shouldn't be violated. If we set full_graph=True, where would it break?

bdhirsh · 2025-05-13T13:45:20Z

Locally, I was seeing that when we compile each transformer block layer, dynamo was trying (and failing) to graph break, because someone was attempting to install backward hooks inside of one of the transformer blocks. If that's surprising to you I can try to find the code that is installing the backward hook.

tianyu-l · 2025-05-14T07:13:51Z

I think the backward hooks are from the auxiliary-loss-free load balancing (#1114).

The load balancing algorithm would possess a bias term for each expert, based on the number of tokens an expert has seen so far.

The single-device algo needs a backward hook to update the bias term after each iteration.
For multi-device, we need another backward hook to all-reduce the bias term across all DP ranks, as different DP ranks see different inputs.

Using forward, forward pre, or backward pre hooks would cause conflict with activation checkpointing.

tianyu-l · 2025-05-21T00:27:32Z

cc @soulitzer on this issue
Basically in order to update the bias terms without intruding model code, I needed to use hooks. To not interfere with AC, I had to use full backward hooks, which breaks full-graph compilation.

I wonder if AC supports optionally bypassing some hook computation, even in the full AC mode?

soulitzer · 2025-05-21T15:49:54Z

@tianyu-l Do you have more details on what the conflict with activation checkpointing is?

Without knowing any more context, one guess is that issue that using forward, forward pre, or backward pre hooks causes conflict with activation checkpointing because it would make the original forward computation and recompute see different bias terms.
And the reason the backward post hook is fine is because it would mean the bias term is updated AFTER the recompute has already been done?
(With this explanation though, forward pre hooks should be fine too?)

tianyu-l · 2025-05-22T16:34:23Z

@soulitzer

Do you have more details on what the conflict with activation checkpointing is?

The problem is, the hook _update_expert_bias should only be executed once, which only full backward hook can achieve. AC would incur repeated computation on other hooks.

(With this explanation though, forward pre hooks should be fine too?)

The problem with forward pre hook is:
After the expected call to _update_expert_bias happens, the self.tokens_per_experts (which has statistics of last iteration) gets cleared after self.expert_bias gets updated.
https://github.com/pytorch/torchtitan/pull/1114/files#diff-87cc24d85c768f0b3d1f5c54cca39dc9de52ee20e8f601814c3200722901aee5R242

During AC recomputation, this hook will be executed again, but this time self.tokens_per_experts has been updated by forward to reflect the statistics of the current iteration.
https://github.com/pytorch/torchtitan/pull/1114/files#diff-87cc24d85c768f0b3d1f5c54cca39dc9de52ee20e8f601814c3200722901aee5R263
This would cause discrepancy in self.expert_bias between activation during forward and activation during recomputation.

soulitzer · 2025-05-29T15:33:19Z

Thanks for the explanation, that makes sense!

I wonder if AC supports optionally bypassing some hook computation, even in the full AC mode?

This is tricky because AC treats the user function like a black box, so it always executes everything a second time. Adding more extensions points to interpose additional setup/takedown logic (e.g. to remove/reinstall hooks) before/after recomputation seems to avoid solving the core issue, which is that side effects execute twice.

I'm actually working on a new tracer-based version of AC repo doc that should address this problem soon.
It traces out the aten operations that your forward executes and only executes those. This would mean that python side effects no longer run twice (proper handling in-place aten ops is not ready yet, but should be able to be done in a nicer way than in the previous AC)
It's still WIP, but so far I've tried it on llama3 8B non-compile (don't really under the FLOPs stats but theoretically they should be strictly lower, peak memory matches with full AC and 2-3% lower on SAC)

tianyu-l · 2025-05-30T04:50:55Z

@soulitzer
This sounds exciting! Let's definitely try it and see if it can help avoid using backward hooks in this problem.

This would mean that python side effects no longer run twice

how would you expose control to users so that when users want to recompute some forward hooks they can still do so?

proper handling in-place aten ops is not ready yet, but should be able to be done in a nicer way than in the previous AC

I guess I'm missing some context on the previous AC treatment of in-place ops.
I do wanna flag that in a recent fix, the _update_expert_bias hook uses in-place addition to update the buffer self.expert_bias.
#1226

soulitzer · 2025-05-30T15:25:02Z

@tianyu-l

This sounds exciting! Let's definitely try it and see if it can help avoid using backward hooks in this problem.

Great! Let me try to get these in-place fixes in soon.

how would you expose control to users so that when users want to recompute some forward hooks they can still do so?

There are no APIs for such explicit control today, but if the forward hooks are in the checkpoint region, and contain aten ops, and those aten ops are required to recompute a saved tensor, then those aten ops would already be recomputed (unless marked saved). OTOH, things like python print/ assignment to a python global, etc. won't be replayed. Do you have more context on what types of operations you'd like to do in these forward hooks?

I guess I'm missing some context on the previous AC treatment of in-place ops.

There were a couple cases for old eager AC wrt in-place:

If the thing being mutated itself was recomputed in the AC region anyway, mutating is safe
If the thing being mutated is an explicit input to the AC region (or saved by SAC), we always error because autograd sees the version counter changing.
If the thing being mutated is a captured input to the AC region (usually parameters/buffers), that tensor gets silently mutated twice, e.g. batch norm.

In the new AC:

This case gets a little more complicated - mutations, even if they are not on inputs, are not necessarily safe anymore, because the new AC does not guarantee to replay the forward ops in the same order. We can decide to save the in-place op or do some kind of clone.
(If the input did not get used by another op prior to mutation) we can instead save the updated version of the tensor instead. If another op depends on the previous version of the tensor, then we may have no other choice than to clone.
(new AC doesn't distinguish between explicit/captured inputs)

For compile, I would need to double check what exactly happens, but for SAC in compile we have a warning that basically says "please don't use in-place"

tianyu-l · 2025-05-30T17:42:18Z

@soulitzer
oh it sounds quite involved

Overall my request from the MoE auxiliary-loss-free load balancing is
"being able to register the hooks in #1114 so that it works nicely with AC and torch.compile".

Specifically

torch.compile doesn't like backward hooks
the hooks I registered don't work well with AC if not as full backward hooks

So if the solution is to make them compatible with AC as forward hooks, we need to make sure:
those hooks to update buffers in place (#1226) only happen in forward but not in backward

Happy to meet some time to clarify the questions and paths.

soulitzer · 2025-05-30T22:03:44Z

@tianyu-l

those hooks to update buffers in place (#1226) only happen in forward but not in backward

I see, overall I'd say this is very much in line with what the new AC wants to achieve, and is what is also tricky to achieve in the old AC design, so it's great to have this as a concrete use case in MOEs!
(To the perspective of AC forward hooks are not any different from any other logic being executed in the forward pass.)

I've just now added some very basic in-place support only tested in eager so far: "If a tensor is mutated, save its latest version. If someone depends on an earlier version, error".
soulitzer/ac-experimental@c7f495f#diff-3665d65394f4f58a56a256ad6dd8621c68118d90fe56a19387e251c19cec2d2eR406
There's still things to be done though, e.g. we may want to smarter things in the case of ops like batch norm.

I'll take a look at the compile side next (about to sign off, but should have something early next week)

soulitzer · 2025-06-18T22:47:19Z

Update:

Was able to run CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile without error (didn't check numerics or memory though):

Moved the sync and bias update logic from the backward hook. But instead of forward hooks, I put the changes at the end of the MOE forward. Dynamo was failing to create guards for the forward hooks possibly due to what it closed over. I don't know how important to put this logic in hooks.
Ran into a possible SAC/ partitioner bug, I have a hack to get around this but still investigating. (EDIT: a possible fix [partitioner] Update multi-output op recompute annotations based on getitem node's annotation pytorch#156420) Should be fixed on the latest commit of new AC
Currently 'selective' per-op AC is the only supported setting. Ops that do RNG aren't currently allowed to be recomputed by the new AC. In particular the op in question is sdpa (presumably because it does drop out, although we can probably special case if dropout_p=0). It shouldn't be too hard to support this though, its just not as ideal because we have to stash the RNG state for every single RNG op instance and that is a bit more overhead.
Patched in Add internal use only utility to allow externally visible side effects within HOPs pytorch#155715 (EDIT: patch is now landed to core, although we may try to find another approach once AC2 is upstreamed) (this patch exposes a utility to disable HOPs from erroring when encountering a side effect - something like this is needed to allow mutating the buffer within the HOP)

tianyu-l · 2025-06-18T23:59:04Z

@soulitzer
Sounds interesting.

Did you run it on single GPU? I ask because I had imagined that the all-reduce below for syncing purpose would break full-graph compile, too
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L138

But as #1304 pointed out, the all-reduce needs to be done outside model across gradient accumulation. So eventually we may not need it in the model code.

soulitzer · 2025-06-19T13:53:21Z

Did you run it on single GPU? I ask because I had imagined that the all-reduce below for syncing purpose would break full-graph compile, too

Hmm I see all_reduce being traced into this dynamo graph: https://www.internalfb.com/phabricator/paste/view/P1845811599
Maybe it is fine :P

(I ran on 4 gpus, w/ NGPU=4)

compile: turn off fullgraph=True to support llama4

227ef47

[ghstack-poisoned]

bdhirsh added a commit that referenced this pull request May 12, 2025

compile: turn off fullgraph=True to support llama4

f85c0ad

ghstack-source-id: 98bb0ed Pull Request resolved: #1182

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 12, 2025

fegin approved these changes May 12, 2025

View reviewed changes

bdhirsh added a commit that referenced this pull request May 12, 2025

compile: turn off fullgraph=True to support llama4

75aa8c9

ghstack-source-id: f22f920 Pull Request resolved: #1182

bdhirsh added a commit that referenced this pull request May 12, 2025

compile: turn off fullgraph=True to support llama4

213d1e9

ghstack-source-id: cd16b65 Pull Request resolved: #1182

bdhirsh added a commit that referenced this pull request May 12, 2025

compile: turn off fullgraph=True to support llama4

02276b0

ghstack-source-id: 1539e21 Pull Request resolved: #1182

tianyu-l requested changes May 13, 2025

View reviewed changes

tianyu-l mentioned this pull request Jun 2, 2025

Llama 4 issue tracking #1118

Open

16 tasks

soulitzer mentioned this pull request Jun 18, 2025

Add internal use only utility to allow externally visible side effects within HOPs pytorch/pytorch#155715

Closed

compile: turn off fullgraph=True to support llama4 #1182

Are you sure you want to change the base?

compile: turn off fullgraph=True to support llama4 #1182

Uh oh!

Conversation

bdhirsh commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin May 12, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 12, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 12, 2025

Choose a reason for hiding this comment

Uh oh!

fegin May 13, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented May 13, 2025

Uh oh!

tianyu-l commented May 14, 2025

Uh oh!

tianyu-l commented May 21, 2025

Uh oh!

soulitzer commented May 21, 2025

Uh oh!

tianyu-l commented May 22, 2025

Uh oh!

soulitzer commented May 29, 2025

Uh oh!

tianyu-l commented May 30, 2025

Uh oh!

soulitzer commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented May 30, 2025

Uh oh!

soulitzer commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soulitzer commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Jun 18, 2025

Uh oh!

soulitzer commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bdhirsh commented May 12, 2025 •

edited

Loading

soulitzer commented May 30, 2025 •

edited

Loading

soulitzer commented May 30, 2025 •

edited

Loading

soulitzer commented Jun 18, 2025 •

edited

Loading

soulitzer commented Jun 19, 2025 •

edited

Loading