[WIP][RFC] Always flatten model state_dict #1347

fegin · 2025-06-26T18:07:01Z

The model state_dict is unique compared to other state dictionaries (e.g., optimizer). It's the only one that will be exported outside of TorchTitan and imported from other sources. To ensure FQN consistency, we previously removed the prefix during the first checkpoint load and last checkpoint save. However, this approach has caused confusion among users, despite available options to control behavior.

This PR aims to resolve the issue by always flattening the model state dictionary, eliminating the "MODEL." prefix from its keys. We decided not to flatten all components due to the risk of key collisions between different components. Instead, this PR only flattens the model state_dict, which is a special case.

While this solution isn't perfect, as it introduces different handling for different components, it's a good compromise given the unique nature of the model state_dict.

Also see the discussion in #1321 (comment)

This is the pseudo code for the current state:

if model_only:
    state_dict = model.state_dict()
else:
    state_dict = {
        "MODEL": model,
        "OPTIMIZER": optimizer,
         ...
     }
}

This is the pseudo code after this PR is landed:

state_dict = model.state_dict()
if not model_only:
    state_dict.update(
        {"OPTIMIZER": optimizer}
         ...
     )

FSDP4 v.s. FSDP4 TP2 loss curve with seed checkpoint and --training.seed=42

tianyu-l

could you help verify "save seed checkpoint -> load from seed checkpoint with different parallelism" still yields identical loss, as an additional integration test?

tianyu-l · 2025-06-26T20:20:41Z

torchtitan/components/checkpoint.py

@@ -573,8 +590,7 @@ def _states_to_load(self, model_only: bool) -> dict[str, Any]:
        """
        # For the first step, we will only load the model weights.
        if model_only:
-            sd = self.states[MODEL].state_dict()
-            return sd
+            return self.states[MODEL].state_dict()


does this mean "model" still exists in state_dict as a key -- we only flatten it in the checkpoint (and its load and save)

In Checkpointer, we still keep separate keys in self.states, like MODEL, OPTIMIZER. This will allow use to manipulate different state_dicts. This line use MODEL to access only the model state_dict but this line does not wrap the model state_dict, so there will be no model. prefix.

fegin · 2025-07-01T06:36:46Z

@tianyu-l The loss curves don't match, w/ or w/o freqs_cis in the seed checkpoint. The seed checkpoint may have broken before freqs_cis being removed from checkpoint.

tianyu-l

In terms of loss curve matching

local_batch_size is 2 when TP is 2, otherwise local_batch_size is 1.

In order to make sure dataloader behaves consistently across multiple runs, we need to fix the DP degree (dp_replicate * dpshard).
So a typical comparison you may have (by keeping the overall DP degree 4) could be

FSDP 4
DP 2, FSDP 2, TP 2
DP 2, FSDP 2, CP 2, PP 2

For details and more examples, please see
https://github.com/pytorch/torchtitan/blob/main/docs/converging.md

@wwwjn we probably should include this in #1363

torchtitan/components/checkpoint.py

fegin · 2025-07-02T18:23:48Z

@tianyu-l I also tried FSDP4 v.s. FSDP4 TP2 before I tried this setting and the loss curves didn't match.

fegin · 2025-07-02T20:02:27Z


Correct the statement, FSDP4 v.s. FSDP4 TP2 match, with or without fixing `training.seed`.

tianyu-l

SGTM!

The model state_dict is unique compared to other state dictionaries (e.g., optimizer). It's the only one that will be exported outside of TorchTitan and imported from other sources. To ensure FQN consistency, we previously removed the prefix during the first checkpoint load and last checkpoint save. However, this approach has caused confusion among users, despite available options to control behavior. This PR aims to resolve the issue by always flattening the model state dictionary, eliminating the `"MODEL."` prefix from its keys. We decided not to flatten all components due to the risk of key collisions between different components. Instead, this PR only flattens the model state_dict, which is a special case. While this solution isn't perfect, as it introduces different handling for different components, it's a good compromise given the unique nature of the model state_dict. Also see the discussion in pytorch#1321 (comment) This is the pseudo code for the current state: ``` if model_only: state_dict = model.state_dict() else: state_dict = { "MODEL": model, "OPTIMIZER": optimizer, ... } } ``` This is the pseudo code after this PR is landed: ``` state_dict = model.state_dict() if not model_only: state_dict.update( {"OPTIMIZER": optimizer} ... ) ``` FSDP4 v.s. FSDP4 TP2 loss curve with seed checkpoint and --training.seed=42 ![Uploading Screenshot 2025-07-02 at 1.02.23 PM.png…]()

fegin requested review from tianyu-l and wwwjn as code owners June 26, 2025 18:07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 26, 2025

tianyu-l reviewed Jun 26, 2025

View reviewed changes

fegin requested a review from wconstab as a code owner June 27, 2025 22:21

tianyu-l reviewed Jul 2, 2025

View reviewed changes

torchtitan/components/checkpoint.py Outdated Show resolved Hide resolved

fegin added 4 commits July 1, 2025 22:50

[WIP][RFC] Always flatten model state_dict

c3dc50a

unittest

4e9cb40

lint

7bb3395

misc

f0d0dd7

fegin force-pushed the fegin/flatten_checkpoint branch from 4050014 to f0d0dd7 Compare July 2, 2025 05:52

misc

f190087

tianyu-l approved these changes Jul 2, 2025

View reviewed changes

fegin merged commit 7d5f3cc into main Jul 3, 2025
7 checks passed

tianyu-l deleted the fegin/flatten_checkpoint branch July 5, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][RFC] Always flatten model state_dict #1347

[WIP][RFC] Always flatten model state_dict #1347

Uh oh!

fegin commented Jun 26, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jun 26, 2025

Uh oh!

fegin Jun 27, 2025

Uh oh!

fegin commented Jul 1, 2025

Uh oh!

tianyu-l left a comment •

edited

Loading

Uh oh!

Uh oh!

fegin commented Jul 2, 2025 •

edited

Loading

Uh oh!

fegin commented Jul 2, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

[WIP][RFC] Always flatten model state_dict #1347

[WIP][RFC] Always flatten model state_dict #1347

Uh oh!

Conversation

fegin commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

fegin commented Jul 1, 2025

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fegin commented Jun 26, 2025 •

edited

Loading

tianyu-l left a comment •

edited

Loading

fegin commented Jul 2, 2025 •

edited

Loading

fegin commented Jul 2, 2025 •

edited

Loading