[DRAFT]: Prediction head architecture clean-up #481

sophie-xhonneux · 2025-07-08T15:23:22Z

Description

Introduce new prediction head architectures. This is a draft and needs to be regression tested in terms of performance. This will be a breaking change because it is a new model architecture.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

- eps in layer norms to 10^-3 - bf16

clessig · 2025-07-08T20:42:19Z

As part of this, we should also want to visit the target coordinate computation.

clessig

Looks exciting and very curious to see the regression plots. I will need another pass eventually but here some first comments.

Comments should be in the standard format (where we follow Anemoi and the rest of ECMWF).

clessig · 2025-07-10T18:13:25Z

config/default_config.yml

@@ -84,7 +84,7 @@ masking_rate_sampling: True
 # sample a subset of all target points, useful e.g. to reduce memory requirements
 # include a masking strategy here, currently only supporting "random" and "block"
 masking_strategy: "random"
-sampling_rate_target: 0.25
+sampling_rate_target: 0.4


In practice, we will use sampling rates of 0.7 or above.

clessig · 2025-07-10T18:17:21Z

config/default_config_atos.yml

@@ -0,0 +1,129 @@
+streams_directory: "./config/streams/streams_anemoi/"


We should use machine specific overwrite configs and not duplicate everythign. That's probably also best done (and tested) in a separate PR.

clessig · 2025-07-10T18:19:03Z

src/weathergen/model/attention.py

@@ -151,7 +157,8 @@ def sparsity_mask(score, b, h, q_idx, kv_idx):

            return flex_attention(qs, ks, vs, score_mod=sparsity_mask)

-        self.compiled_flex_attention = torch.compile(att, dynamic=False)
+        self.compiled_flex_attention = torch.compile(att, dynamic=False, mode="max-autotune")


Did you test that this does not degrade performance. I didn't have good experience with this option.

clessig · 2025-07-10T18:29:56Z

src/weathergen/model/norms.py

+        nn.init.zeros_(self.adaLN_modulation[-1].weight)
+        nn.init.zeros_(self.adaLN_modulation[-1].bias)
+
+    def forward(self, x: torch.Tensor, c: torch.Tensor, **kwargs) -> torch.Tensor:


We need to specify what an admissible c is in terms of shape.

clessig · 2025-07-10T18:31:18Z

src/weathergen/model/norms.py

+            apply_gate(
+                self.layer(modulate(self.ln(x), shift, scale, 9, self.dim), c, **kwargs),
+                gate,
+                9,


Where does the magic 9 come from here?

clessig · 2025-07-10T18:39:53Z

src/weathergen/model/model.py

+                    ),
+                    tcs_lens,
+                ).sum(dim=1)
+                / 8


Why / 8 ?

clessig · 2025-07-10T18:40:14Z

src/weathergen/model/model.py

+                        ]
+                    ),
+                    tcs_lens,
+                ).sum(dim=1)


Over which dimension is the sum?

clessig · 2025-07-10T18:49:10Z

src/weathergen/model/engines.py

-                    norm_eps=self.cf.mlp_norm_eps,
-                )
+    def forward(self, latent, output_cond, latent_lens, output_cond_lens):
+        latent = self.dim_adapter(latent)


latent are here the global_tokens? output_cond is the conditioning through the coordinates?

clessig · 2025-07-10T18:49:37Z

src/weathergen/model/engines.py

+                    with_mlp=True,
+                    attention_kwargs=attention_kwargs,
+                ))
+            elif self.cf.decoder_type == "CrossAttentionAdaNormConditioning":


This is the existing version?

clessig · 2025-07-10T18:51:35Z

src/weathergen/model/engines.py

+        for ith, dim in enumerate(self.dims_embed[:-1]):
+            next_dim = self.dims_embed[ith+1]
+            if self.cf.decoder_type == "PerceiverIO":
+                # a single cross attention layer as per https://arxiv.org/pdf/2107.14795


We should document the different options at least with a few lines. What is in, how is conditioning applied.

Previously this pr treated as independent

clessig and others added 6 commits July 8, 2025 15:08

- Avoid time encoding is 0

20cb8f9

- eps in layer norms to 10^-3 - bf16

Make the attention dtype and norm eps configurable

47068eb

Fix gitignore and add config files

7e14345

Shuffle config files into sensible folders

65c26c1

Implement first attempt at new prediction heads

cf72f07

Fix some bugs

0cb301d

sophie-xhonneux requested a review from shmh40 July 8, 2025 15:31

tjhunter marked this pull request as draft July 8, 2025 17:20

clessig added enhancement New feature or request model Related to model training or definition (not generic infra) labels Jul 8, 2025

clessig added this to WeatherGen-dev Jul 8, 2025

sophie-xhonneux added 3 commits July 10, 2025 14:05

Fix trainer compile + fsdp

d04d408

Fix trainer and better defaults

fd4af70

Choose AdaLN

0a5b84b

clessig self-requested a review July 10, 2025 17:50

clessig requested changes Jul 10, 2025

View reviewed changes

Correlate predictions per cell

ab4f85c

Previously this pr treated as independent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT]: Prediction head architecture clean-up #481

[DRAFT]: Prediction head architecture clean-up #481

Uh oh!

sophie-xhonneux commented Jul 8, 2025

Uh oh!

clessig commented Jul 8, 2025

Uh oh!

clessig left a comment

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

clessig Jul 10, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,129 @@
		streams_directory: "./config/streams/streams_anemoi/"

[DRAFT]: Prediction head architecture clean-up #481

Are you sure you want to change the base?

[DRAFT]: Prediction head architecture clean-up #481

Uh oh!

Conversation

sophie-xhonneux commented Jul 8, 2025

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

clessig commented Jul 8, 2025

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!