[WIP] Compile for dp2ep #1365

xmfan · 2025-07-03T16:30:05Z

Status

🚧 NEED pytorch/pytorch changes 🚧
- [c10d] support dynamic shapes for all_to_all_single_autograd pytorch#157521
- [inductor] use aten grouped_mm by default, move template autotuning to max-autotune pytorch#158045
Graph breaks/Recompiles for MoE. tlparse: https://fburl.com/amorfehs, still looking through the graph breaks
- Mostly due to experts being wrapped by MoE. Since we will not be tracing FSDP wrapper, these graph breaks are considered fundamental until we migrate to SimpleFSDP.
- staticmethod on user-defined classes can not be generically supported, I moved those out.
- TODO: token dispatch a2a overlap with expert

Repro

tested on debug model NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --parallelism.data_parallel_shard_degree=2 --parallelism.expert_parallel_degree=2 --training.compile
logs: https://gist.github.com/xmfan/41b822d9f09eb07fee62d684a061cec1

memory: 2.20GiB -> 1.42GiB
speedup: no big change, need to check with actual model

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 3, 2025

tianyu-l force-pushed the ep branch from 08c1ff1 to b0dffa1 Compare July 8, 2025 04:27

Base automatically changed from ep to main July 8, 2025 16:47

compile, but there's a hang for grouped_mm autotuning

2ef0e26

xmfan force-pushed the ep-compile branch from 050291b to 2ef0e26 Compare July 9, 2025 23:48

clean

3cde52e

xmfan mentioned this pull request Jul 11, 2025

[inductor] grouped_mm is autotuning under torch.compile default mode pytorch/pytorch#158042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Compile for dp2ep #1365

[WIP] Compile for dp2ep #1365

xmfan commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] Compile for dp2ep #1365

Are you sure you want to change the base?

[WIP] Compile for dp2ep #1365

Conversation

xmfan commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Repro

Uh oh!

Uh oh!

xmfan commented Jul 3, 2025 •

edited

Loading