Open
Description
I've noticed a potential inconsistency in how the VAE-encoded control_image is processed between the training script for ControlNet with Stable Diffusion 3 and the corresponding inference pipeline.
In the inference pipeline (pipeline_stable_diffusion_3_controlnet.py):
The control_image latent is processed by both subtracting the vae_shift_factor and multiplying by the scaling_factor.
However, in the provided training example, the VAE-encoded controlnet_image is only multiplied by the scaling_factor, without subtracting the shift_factor.
Shouldn't the training script also apply the vae_shift_factor to maintain consistency with the inference process? It seems the correct implementation in the training script should be:controlnet_image = (controlnet_image - vae.config.shift_factor) * vae.config.scaling_factor
Metadata
Metadata
Assignees
Labels
No labels