Question about control sd3

I've noticed a potential inconsistency in how the VAE-encoded control_image is processed between the training script for ControlNet with Stable Diffusion 3 and the corresponding inference pipeline.

In the inference pipeline (pipeline_stable_diffusion_3_controlnet.py):

The control_image latent is processed by both subtracting the vae_shift_factor and multiplying by the scaling_factor.

https://github.com/huggingface/diffusers/blob/425a715e35479338c06b2a68eb3a95790c1db3c5/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet.py#L1074 In the training script (train_controlnet_sd3.py):

However, in the provided training example, the VAE-encoded controlnet_image is only multiplied by the scaling_factor, without subtracting the shift_factor.

https://github.com/huggingface/diffusers/blob/425a715e35479338c06b2a68eb3a95790c1db3c5/examples/controlnet/train_controlnet_sd3.py#L1333   Shouldn't the training script also apply the vae_shift_factor to maintain consistency with the inference process? It seems the correct implementation in the training script should be: `controlnet_image = (controlnet_image - vae.config.shift_factor) * vae.config.scaling_factor`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about control sd3 #11867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about control sd3 #11867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions