-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[SD-XL] Add new pipelines #3859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The documentation is not available anymore as the PR was closed or merged. |
if self.config.timestep_spacing == "linspace": | ||
timesteps = np.linspace(0, self.config.num_train_timesteps - 1, num_inference_steps, dtype=float)[::-1].copy() | ||
elif self.config.timestep_spacing == "leading": | ||
step_ratio = self.config.num_train_timesteps // self.num_inference_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new spacing, doesn't give drastically better results, but better results nevertheless IMO. It's also needed to have 1-to-1 the same results as original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the original code (XL) use this new spacing scheme, though?
num_transformer_blocks (`int` or `Tuple[int]`, *optional*, defaults to 1): | ||
The number of transformer blocks of type [`~models.attention.BasicTransformerBlock`]. Only relevant for [`~models.unet_2d_blocks.CrossAttnDownBlock2D`], [`~models.unet_2d_blocks.CrossAttnUpBlock2D`], [`~models.unet_2d_blocks.UNetMidBlock2DCrossAttn`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a Transformer block can be a UNet block? I don't find the num_transformer_blocks
name to be a good one to encompass all the blocks we're supporting here. But cannot think of a better one, either. So, okay to ignore I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, maybe transformer_layers_per_block
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better for sure.
def convert_open_clip_checkpoint(checkpoint): | ||
text_model = CLIPTextModel.from_pretrained("stabilityai/stable-diffusion-2", subfolder="text_encoder") | ||
def convert_open_clip_checkpoint(checkpoint, prefix="cond_stage_model.model."): | ||
# text_model = CLIPTextModel.from_pretrained("stabilityai/stable-diffusion-2", subfolder="text_encoder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not affecting the SD 2 conversion process with this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double check!
num_train_timesteps = original_config.model.params.timesteps or 1000 | ||
beta_start = original_config.model.params.linear_start or 0.02 | ||
beta_end = original_config.model.params.linear_end or 0.085 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these numbers coming from? I'd make a note for our future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is hacky for now and shouldn't be this way
text_encoder_lora_scale = ( | ||
cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None | ||
) | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 tensors are returned instead of just one.
The first 2 tensors are the normal pos and neg prompt embeddings that are passed into cross attention. The last 2 "pooled" embeds are used to additional condition the time embedding
Fix embeddings for classic SD models.
@@ -107,6 +107,13 @@ class EulerDiscreteScheduler(SchedulerMixin, ConfigMixin): | |||
This parameter controls whether to use Karras sigmas (Karras et al. (2022) scheme) for step sizes in the | |||
noise schedule during the sampling process. If True, the sigmas will be determined according to a sequence | |||
of noise levels {σi} as defined in Equation (5) of the paper https://arxiv.org/pdf/2206.00364.pdf. | |||
timestep_spacing (`str`, default `"linspace"`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes should also work well for other schedulers
src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py
Show resolved
Hide resolved
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under | ||
`self.processor` in | ||
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). | ||
guidance_rescale (`float`, *optional*, defaults to 0.7): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaults to 0.0*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes we should probably fix this in a follow-up PR! Sorry just noticed the comment here. Would you like to open a PR here maybe @bghira ? :-)
This reverts commit 491bc9f.
he just merge the feature branch to |
thanks |
* Add new text encoder * add transformers depth * More * Correct conversion script * Fix more * Fix more * Correct more * correct text encoder * Finish all * proof that in works in run local xl * clean up * Get refiner to work * Add red castle * Fix batch size * Improve pipelines more * Finish text2image tests * Add img2img test * Fix more * fix import * Fix embeddings for classic models (huggingface#3888) Fix embeddings for classic SD models. * Allow multiple prompts to be passed to the refiner (huggingface#3895) * finish more * Apply suggestions from code review * add watermarker * Model offload (huggingface#3889) * Model offload. * Model offload for refiner / img2img * Hardcode encoder offload on img2img vae encode Saves some GPU RAM in img2img / refiner tasks so it remains below 8 GB. --------- Co-authored-by: Patrick von Platen <[email protected]> * correct * fix * clean print * Update install warning for `invisible-watermark` * add: missing docstrings. * fix and simplify the usage example in img2img. * fix setup for watermarking. * Revert "fix setup for watermarking." This reverts commit 491bc9f. * fix: watermarking setup. * fix: op. * run make fix-copies. * make sure tests pass * improve convert * make tests pass * make tests pass * better error message * fiinsh * finish * Fix final test --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
* Add new text encoder * add transformers depth * More * Correct conversion script * Fix more * Fix more * Correct more * correct text encoder * Finish all * proof that in works in run local xl * clean up * Get refiner to work * Add red castle * Fix batch size * Improve pipelines more * Finish text2image tests * Add img2img test * Fix more * fix import * Fix embeddings for classic models (huggingface#3888) Fix embeddings for classic SD models. * Allow multiple prompts to be passed to the refiner (huggingface#3895) * finish more * Apply suggestions from code review * add watermarker * Model offload (huggingface#3889) * Model offload. * Model offload for refiner / img2img * Hardcode encoder offload on img2img vae encode Saves some GPU RAM in img2img / refiner tasks so it remains below 8 GB. --------- Co-authored-by: Patrick von Platen <[email protected]> * correct * fix * clean print * Update install warning for `invisible-watermark` * add: missing docstrings. * fix and simplify the usage example in img2img. * fix setup for watermarking. * Revert "fix setup for watermarking." This reverts commit 491bc9f. * fix: watermarking setup. * fix: op. * run make fix-copies. * make sure tests pass * improve convert * make tests pass * make tests pass * better error message * fiinsh * finish * Fix final test --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
Usage for
"stabilityai/stable-diffusion-xl-base-0.9"
:In addition make sure to install
transformers
,safetensors
,accelerate
as well as the invisible watermark:You can use the model then as follows
When using
torch >= 2.0
, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:If you are limited by GPU VRAM, you can enable cpu offloading by calling
pipe.enable_model_cpu_offload
instead of
.to("cuda")
:Usage for
"stabilityai/stable-diffusion-xl-refiner-0.9"
When using
torch >= 2.0
, you can improve the inference speed by 20-30% with torch.compile. Simple wrap the unet with torch compile before running the pipeline:If you are limited by GPU VRAM, you can enable cpu offloading by calling
pipe.enable_model_cpu_offload
instead of
.to("cuda")
: