Add Stochastic flow matching for corrdiff #836

daviddpruitt · 2025-04-02T17:05:22Z

PhysicsNeMo Pull Request

Description

This change adds stochastic flowmatching to the existing CorrDiff implementation

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

No additional dependencies required

examples/generative/corrdiff_plus_plus/README.md

examples/generative/corrdiff/train.py

Revert changes to the corrdiff train examples.

Accidently deleted line

Typo fix

pzharrington · 2025-04-11T17:08:54Z

physicsnemo/metrics/diffusion/sfm_loss.py

+        ) ** 2
+
+        # augment for conditional generaiton
+        x_tot = torch.cat((img_clean, img_lr), dim=1)


This concat has no real point, right? x_tot is immediately split out back into x_1 and x_low and then never accessed again. Also, what does the note about augmenting and conditional generation mean?

pzharrington · 2025-04-11T17:12:13Z

physicsnemo/metrics/diffusion/sfm_loss.py

+        def time_weight(t):
+            return 1
+
+        sfm_loss = weight * ((time_weight(time) * (D_x_t - x_1)) ** 2)


Let's drop the time_weight if it's just hard-coded to 1

pzharrington · 2025-04-11T17:24:57Z

physicsnemo/models/diffusion/encoders.py

I'm confused about this module. Can you explain the motivation for adding a separate encoders file and putting this in it, rather than just using the Conv2d in layers.py?

pzharrington · 2025-04-11T17:40:28Z

physicsnemo/models/diffusion/sfm_preconditioning.py

My high-level comment on this is that functionally SFMPrecondSR has the same behavior as EDMPrecondSR (aside from the extra get_sigma_max and update_sigma_max methods). At least as far as I can tell, correct me if I'm wrong.

Can you possibly refactor so that EDMPrecondSR can be used instead? The reason is we are currently suffering from a growing number of near-identical copies of EDM utilities (from CorrDiff variants and other projects), which is becoming hard to maintain and increasingly confusing for users/developers. It seems like here we can simply add the extra methods (which are simple) to the EDMPrecondSR along with some config/arg checking in __init__ to ensure proper usage, and we avoid having an extra module with extra CI tests, etc. @CharlelieLrt for viz

pzharrington · 2025-04-11T17:45:15Z

physicsnemo/models/diffusion/sfm_preconditioning.py

+class SFMPrecondEmpty(Module):
+    """
+    A preconditioner that does nothing
+


For SFMPrecondEmpty I suggest just adding a simpler one-liner nn.Module or something in the train script and avoid putting this in the core physicsnemo package code. Since the need for it is really example-specific and only supports the edge case of training an encoder-only model in that example (I assume for a point of comparison against the real SFM model which has encoder/denoiser trained jointly)

pzharrington · 2025-04-11T17:46:05Z

physicsnemo/utils/generative/sfm_samplers.py

+
+def sigma_inv(sigma):
+    return sigma
+


Let's remove these hard-coded identity functions for simplicity

pzharrington · 2025-04-11T17:49:16Z

physicsnemo/utils/generative/sfm_samplers.py

+
+@nvtx.annotate(message="SFM_encoder_sampler", color="red")
+def SFM_encoder_sampler(
+    networks: Dict[str, torch.nn.Module],


Similar comment as for SFMPrecondEmpty. I think this can either be stashed in a utils file local to the example and not in the physicsnemo core, or alternately support an option in the SFM_sampler to allow running "encoder-only" sampling.

pzharrington · 2025-04-11T17:57:23Z

physicsnemo/utils/generative/sfm_samplers.py

+
+
+@nvtx.annotate(message="SFM_Euler_sampler_Adaptive_Sigma", color="red")
+def SFM_Euler_sampler_Adaptive_Sigma(


It seems like this can be absorbed into the main SFM_Euler_sampler and behavior can be controlled with function args, no? Again, similar motivation, reducing duplication of nearly the same functionality.

pzharrington · 2025-04-11T18:03:02Z

examples/generative/corrdiff_plus_plus/README.md

+```
+
+Some legacy plotting scripts are also available in the `inference` directory.
+You can also bring your checkpoints to [earth2studio]<https://github.com/NVIDIA/earth2studio>


What is meant by legacy plotting scripts here? If they don't work with the output generated by the scoring or generation scripts, I suggest just removing them. Also, the earth2studio link here doesn't render properly, FYI.

pzharrington · 2025-04-11T18:05:24Z

examples/generative/corrdiff_plus_plus/README.md

+
+### Preliminaries
+Start by installing Modulus (if not already installed) and copying this folder (`examples/generative/corrdiff++`) to a system with a GPU available. Also download the CorrDiff++ dataset from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/modulus/resources/modulus_datasets-hrrr_mini).
+


Modulus->PhysicsNemo. Also, this link is to the HRRR mini dataset, but it looks like the configs only support training on the Taiwan data?

pzharrington · 2025-04-11T18:05:55Z

examples/generative/corrdiff_plus_plus/README.md

+- An adaptive noise scaling mechanism informed by the encoder’s RMSE, used to inject calibrated uncertainty  
+- A final flow matching step to refine latent samples and synthesize fine-scale physical details
+
+AFM outperforms previous methods across both real-world (e.g., 25 → 2 km super-resolution in Taiwan) and synthetic (Kolmogorov flow) benchmarks—especially for highly stochastic output channels.


pzharrington · 2025-04-11T18:07:56Z

examples/generative/corrdiff_plus_plus/README.md

+The results, including logs and checkpoints, are saved by default to `outputs/mini_generation/`. You can direct the checkpoints to be saved elsewhere by setting: `++training.io.checkpoint_dir=</path/to/checkpoints>`.
+
+> **_Out of memory?_** CorrDiff-Mini trains by default with a batch size of 256 (set by `training.hp.total_batch_size`). If you're using a single GPU, especially one with a smaller amout of memory, you might see out-of-memory error. If that happens, set a smaller batch size per GPU, e.g.: `++training.hp.batch_size_per_gpu=16`. CorrDiff training will then automatically use gradient accumulation to train with an effective batch size of `training.hp.total_batch_size`.
+


Similar comment for the CorrDiff-Mini model/dataset mentioned here -- is there a plan to implement/support that or is this example just intended to cover Taiwan/CWB for now?

pzharrington · 2025-04-11T18:27:09Z

examples/generative/corrdiff_plus_plus/helpers/sfm_utils.py

+def get_encoder(cfg: DictConfig):
+    """
+    Helper that sets instantiates a
+


Docstring typo

pzharrington · 2025-04-11T18:33:49Z

I think all my main comments are in. Aside from minor things, I think the main thing to focus on is how much the additional SFM stuff can be absorbed into existing EDM functionality in PhysicsNemo. This will greatly help our efforts going forward in improving the usability/readability/extensibility of these modules.

daviddpruitt and others added 22 commits March 31, 2025 17:23

Adding modules and initial pass of example code

3caf586

add configs, update training script for sfm

9b056ee

update configs

71e8964

update imports, formatting

a065c74

fix comment formatting

51f789c

add missing encoder

4f57f6f

update configs and imports

6bc84c5

add missing denoiser ema and ddp

d6f148d

various bugfixes

e6f3398

fix missing attributes

5ea8900

update generation scripts

409b011

various fixes for generation

b093b08

update to physicsnemo

f539e99

fixes for sfm_encoder and sfm_two_stage

5241da0

Linting and format fixes

89ab0e0

add sfm loss tests

5e649a7

Add tests

d153ea6

add tests for sfm

300a88c

fix for change in loss interface

1b921e4

rename to cordiff plus plus

e306842

add readme

2d75783

Merge branch 'NVIDIA:main' into sfm

96492d9

daviddpruitt requested a review from mnabian April 4, 2025 23:48

daviddpruitt marked this pull request as ready for review April 4, 2025 23:48

mnabian requested a review from pzharrington April 8, 2025 18:47

pzharrington reviewed Apr 8, 2025

View reviewed changes

examples/generative/corrdiff_plus_plus/README.md Outdated Show resolved Hide resolved

pzharrington reviewed Apr 9, 2025

View reviewed changes

examples/generative/corrdiff/train.py Outdated Show resolved Hide resolved

daviddpruitt added 3 commits April 10, 2025 14:08

Update train.py

5d4c7e3

Revert changes to the corrdiff train examples.

Update train.py

302fb98

Accidently deleted line

Update README.md

67d3e54

Typo fix

pzharrington reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Stochastic flow matching for corrdiff #836

Add Stochastic flow matching for corrdiff #836

Uh oh!

daviddpruitt commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025 •

edited

Loading

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington Apr 11, 2025

Uh oh!

pzharrington commented Apr 11, 2025

Uh oh!

Uh oh!



		@nvtx.annotate(message="SFM_Euler_sampler_Adaptive_Sigma", color="red")
		def SFM_Euler_sampler_Adaptive_Sigma(


		### Preliminaries
		Start by installing Modulus (if not already installed) and copying this folder (`examples/generative/corrdiff++`) to a system with a GPU available. Also download the CorrDiff++ dataset from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/modulus/resources/modulus_datasets-hrrr_mini).

		The results, including logs and checkpoints, are saved by default to `outputs/mini_generation/`. You can direct the checkpoints to be saved elsewhere by setting: `++training.io.checkpoint_dir=</path/to/checkpoints>`.

		> _Out of memory?_ CorrDiff-Mini trains by default with a batch size of 256 (set by `training.hp.total_batch_size`). If you're using a single GPU, especially one with a smaller amout of memory, you might see out-of-memory error. If that happens, set a smaller batch size per GPU, e.g.: `++training.hp.batch_size_per_gpu=16`. CorrDiff training will then automatically use gradient accumulation to train with an effective batch size of `training.hp.total_batch_size`.

Add Stochastic flow matching for corrdiff #836

Are you sure you want to change the base?

Add Stochastic flow matching for corrdiff #836

Uh oh!

Conversation

daviddpruitt commented Apr 2, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzharrington Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzharrington commented Apr 11, 2025

Uh oh!

Uh oh!

pzharrington Apr 11, 2025 •

edited

Loading