Add CLIP and T5XXL for StableDiffusionV3 #1790

james77777778 · 2024-08-22T08:18:01Z

Please refer to these colabs for numerics checks:

CLIPPreprocessor: https://colab.research.google.com/drive/1vCAUyVXetyya5EJ_o7U8xuXZCnhgDRxH
T5XXLPreprocessor: https://colab.research.google.com/drive/1UBerWxCEFwyr2QfUjKROhV7rPQObo6P-
CLIPTextEncoder: https://colab.research.google.com/drive/1tHKAZYgEhu_Isgo3fd3mAK7uMTkOEW3w
T5XXLTextEncoder: https://colab.research.google.com/drive/1z99c9JSGRJEHmTesHV0wU-K0K9mrcYgC

CLIPTokenizer and T5XXLPreprocessor are verified using transformers library as the reference implementation.

The reference implementations of CLIPTextEncoder and T5XXLTextEncoder come from private code.

Notes for reviewing:

CLIPTokenizer subclasses BytePairTokenizer with extensive modifications. Tests have been included.
CLIPPreprocessor subclasses Preprocessor and uses StartEndPacker for adding start & end token.
CLIPTextEncoder consists of CLIPAttention and CLIPEncoderBlock and subclasses Backbone.
T5XXLPreprocessor subclasses Preprocessor and uses StartEndPacker for adding start token.
T5XXLTextEncoder is similar to T5 but excludes the decoder part.

Future works:

Implement CLIPPreprocessor
Wrap CLIPTextEncoder and T5XXLTextEncoder for the use in StableDiffusionV3
Implement VAEImageDecoder (Add VAEImageDecoder for StableDiffusionV3 #1796)
Implement MMDiT
Implement StableDiffusionV3 (inference model)

@divyashreepathihalli @mattdangerw @SamanehSaadat

Should I finish all the implementations in this PR, or would it be better to separate them into multiple ones?

…xtEncoder`.

mattdangerw · 2024-08-22T16:57:00Z

@james77777778 I think probably smaller PRs will be easier as a general strategy will be easier. So let's go with "multiple" to your question above.

General context for me, why do we need two tokenizers and text encoders?

keras_nlp/src/models/stable_diffusion_v3/t5_xxl_tokenizer.py

james77777778 · 2024-08-23T01:31:02Z

@mattdangerw

General context for me, why do we need two tokenizers and text encoders?

Actually, we need three pairs of them...

https://github.com/huggingface/diffusers/blob/dc07fc29da8a593f68080fbde0b9161a9a68bd36/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L131-L180

In their paper (https://arxiv.org/abs/2403.03206), they mention that:

While the main motivation for using multiple text-encoders is boosting the overall model performance

It should also be possible to have flexible text encoders:

we train our model with three text encoders, with an individual drop-out rate of 46.3%. Hence, at inference time, we can use an arbitrary subset of all three text encoders

However, the reference implementation currently uses all of them. We could implement the flexible design in a future PR.

Add `CLIPPreprocessor`

mattdangerw

Just some minor comments. I suspect the preprocessing might need to move around a bit as we figure out our general vision preprocessing story, but no need to block on that I think.

keras_nlp/src/models/stable_diffusion_v3/clip_preprocessor.py

mattdangerw · 2024-08-23T21:26:43Z

keras_nlp/src/models/stable_diffusion_v3/clip_preprocessor.py

+    def call(self, x, y=None, sample_weight=None, sequence_length=None):
+        x = convert_inputs_to_list_of_tensor_segments(x)
+        if len(x) != 1:
+            raise ValueError(


side note, I've cleaned this up on a commit now on master, this will change slightly when i rebase the whole branch

I've added a TODO for this call

mattdangerw · 2024-08-23T21:28:01Z

keras_nlp/src/models/stable_diffusion_v3/clip_text_encoder.py

+)
+
+
+class CLIPTextEncoder(Backbone):


I think we can land all of this in SD3 folder for now.

But we might eventually want clip in it's own folder here. Usable with SD3. WDYT @divyashreepathihalli ?

The implementation will be cleaner if we have a CLIP model. I can try adding it after landing SD3.

@james77777778, CLIP is in KerasCV - it needs to be added to keras_hub still. That would make this implementation cleaner for sure.

Given that this is only using the text encoder part and not the whole clip model. We can land this as is now and clean up once CLIP is added.

mattdangerw · 2024-08-23T21:29:45Z

keras_nlp/src/models/stable_diffusion_v3/t5_xxl_text_encoder.py

+from keras_nlp.src.models.t5.t5_transformer_layer import T5TransformerLayer
+
+
+class T5XXLTextEncoder(Backbone):


Not necessarily for this PR, but I think we probably should make it possible to instantiate only a t5 encoder model from our actual t5 implementation. Then you could just use that symbol directly here.

I can try to decompose T5 into encoder and decoder archs after landing SD3

keras_nlp/src/models/stable_diffusion_v3/clip_tokenizer.py

mattdangerw · 2024-08-23T21:35:13Z

@james77777778

Actually, we need three pairs of them...

So the actually just concat the encode representation together or something like that?

However, the reference implementation currently uses all of them. We could implement the flexible design in a future PR.

Yeah, let's do whatever is easier for now. This quote from the paper makes me wonder if only using the clip encoders might be the better default. Adding 5B parameters just for a slight performance boost is probably not something most of our users will want.

But let's keep building whatever is simplest for now, can figure that out later.

james77777778 · 2024-08-24T02:59:04Z

So the actually just concat the encode representation together or something like that?

As far as I know, yes.
The encoding of CLIP is zero-padded to the size of the T5XXL encoding and then concatenated.

Yeah, let's do whatever is easier for now. This quote from the paper makes me wonder if only using the clip encoders might be the better default. Adding 5B parameters just for a slight performance boost is probably not something most of our users will want.

But let's keep building whatever is simplest for now, can figure that out later.

Sounds good! Will investigate the drop in T5 after finishing SD3.
BTW, work with T5 model is really a pain because I can't fit it on my GPU and can only run it on the CPU, which is very slow.

EDITED:
I made the implementation much clearer in CLIPEncoderBlock by using layers.MultiHeadAttention. The numeric check has been also updated.
CLIPTextEncoder: https://colab.research.google.com/drive/1tHKAZYgEhu_Isgo3fd3mAK7uMTkOEW3w

…tion`

divyashreepathihalli

LGTM

mattdangerw · 2024-08-26T18:30:05Z

Let's pull this in! I suspect there will be a lot of churn on this model (given potential refactors of clip, t5 and still hazy preprocessing), but I think we should keep moving and reshuffle as we go.

* Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`. * Make CLIPTextEncoder as Backbone * Add `T5XXLPreprocessor` and remove `T5XXLTokenizer` Add `CLIPPreprocessor` * Use `tf = None` at the top * Replace manual implementation of `CLIPAttention` with `MultiHeadAttention`

* Add VGG16 backbone (#1737) * Agg Vgg16 backbone * update names * update tests * update test * add image classifier * incorporate review comments * Update test case * update backbone test * add image classifier * classifier cleanup * code reformat * add vgg16 image classifier * make vgg generic * update doc string * update docstring * add classifier test * update tests * update docstring * address review comments * code reformat * update the configs * address review comments * fix task saved model test * update init * code reformatted * Add `ResNetBackbone` and `ResNetImageClassifier` (#1765) * Add ResNetV1 and ResNetV2 * Address comments * Add CSP DarkNet backbone and classifier (#1774) * Add CSP DarkNet * Add CSP DarkNet * snake_case function names * change use_depthwise to block_type * Add `FeaturePyramidBackbone` and port weights from `timm` for `ResNetBackbone` (#1769) * Add FeaturePyramidBackbone and update ResNetBackbone * Simplify the implementation * Fix CI * Make ResNetBackbone compatible with timm and add FeaturePyramidBackbone * Add conversion implementation * Update docstrings * Address comments * Add DenseNet (#1775) * Add DenseNet * fix testcase * address comments * nit * fix lint errors * move description * Add ViTDetBackbone (#1776) * add vit det vit_det_backbone * update docstring * code reformat * fix tests * address review comments * bump year on all files * address review comments * rename backbone * fix tests * change back to ViT * address review comments * update image shape * Add Mix transformer (#1780) * Add MixTransformer * fix testcase * test changes and comments * lint fix * update config list * modify testcase for 2 layers * update input_image_shape -> image_shape (#1785) * update input_image_shape -> image_shape * update docstring example * code reformat * update tests * Create __init__.py (#1788) add missing __init__ file to vit_det * Hack package build script to rename to keras-hub (#1793) This is a temporary way to test out the keras-hub branch. - Does a global rename of all symbols during package build. - Registers the "old" name on symbol export for saving compat. - Adds a github action to publish every commit to keras-hub as a new package. - Removes our descriptions on PyPI temporarily, until we want to message this more broadly. * Add CLIP and T5XXL for StableDiffusionV3 (#1790) * Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`. * Make CLIPTextEncoder as Backbone * Add `T5XXLPreprocessor` and remove `T5XXLTokenizer` Add `CLIPPreprocessor` * Use `tf = None` at the top * Replace manual implementation of `CLIPAttention` with `MultiHeadAttention` * Add Bounding Box Utils (#1791) * Bounding box utils * - Correct test cases * - Remove hard tensorflow dtype * - fix api gen * - Fix import for test cases - Use setup for converters test case * - fix api_gen issue * - FIx api gen * - Fix api gen error * - Correct test cases as per new api changes * mobilenet_v3 added in keras-nlp (#1782) * mobilenet_v3 added in keras-nlp * minor bug fixed in mobilenet_v3_backbone * formatting corrected * refactoring backbone * correct_pad_downsample method added * refactoring backbone * parameters updated * Testcaseupdated, expected output shape corrected * code formatted with black * testcase updated * refactoring and description added * comments updated * added mobilenet v1 and v2 * merge conflict resolved * version arg removed, and config options added * input_shape changed to image_shape in arg * config updated * input shape corrected * comments resolved * activation function format changed * minor bug fixed * minor bug fixed * added vision_backbone_test * channel_first bug resolved * channel_first cases working * comments resolved * formatting fixed * refactoring --------- Co-authored-by: ushareng <[email protected]> * Pkgoogle/efficient net migration (#1778) * migrating efficientnet models to keras-hub * merging changes from other sources * autoformatting pass * initial consolidation of efficientnet_backbone * most updates and removing separate implementation * cleanup, autoformatting, keras generalization * removed layer examples outside of effiicient net * many, mainly documentation changes, small test fixes * Add the ResNet_vd backbone (#1766) * Add ResNet_vd to ResNet backbone * Addressed requested parameter changes * Fixed tests and updated comments * Added new parameters to docstring * Add `VAEImageDecoder` for StableDiffusionV3 (#1796) * Add `VAEImageDecoder` for StableDiffusionV3 * Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention` * Replace `Backbone` with `keras.Model` in `CLIPTextEncoder` and `T5XXLTextEncoder` (#1802) * Add pyramid output for densenet, cspDarknet (#1801) * add pyramid outputs * fix testcase * format fix * make common testcase for pyramid outputs * change default shape * simplify testcase * test case change and add channel axis * Add `MMDiT` for StableDiffusionV3 (#1806) * Add `MMDiT` * Update * Update * Update implementation * Add remaining bbox utils (#1804) * - Add formats, iou, utils for bounding box * - Add `AnchorGenerator`, `BoxMatcher` and `NonMaxSupression` layers * - Remove scope_name not required. * use default keras name scope * - Correct format error * - Remove layers as of now and keep them at model level till keras core supports them * - Correct api_gen * Fix timm conversion for rersnet (#1814) * Add `StableDiffusion3` * Fix `_normalize_inputs` * Separate CLIP encoders from SD3 backbone. * Simplify `text_to_image` function. * Address comments * Minor update and add docstrings. * Add VGG16 backbone (#1737) * Agg Vgg16 backbone * update names * update tests * update test * add image classifier * incorporate review comments * Update test case * update backbone test * add image classifier * classifier cleanup * code reformat * add vgg16 image classifier * make vgg generic * update doc string * update docstring * add classifier test * update tests * update docstring * address review comments * code reformat * update the configs * address review comments * fix task saved model test * update init * code reformatted * Add `ResNetBackbone` and `ResNetImageClassifier` (#1765) * Add ResNetV1 and ResNetV2 * Address comments * Add CSP DarkNet backbone and classifier (#1774) * Add CSP DarkNet * Add CSP DarkNet * snake_case function names * change use_depthwise to block_type * Add `FeaturePyramidBackbone` and port weights from `timm` for `ResNetBackbone` (#1769) * Add FeaturePyramidBackbone and update ResNetBackbone * Simplify the implementation * Fix CI * Make ResNetBackbone compatible with timm and add FeaturePyramidBackbone * Add conversion implementation * Update docstrings * Address comments * Add DenseNet (#1775) * Add DenseNet * fix testcase * address comments * nit * fix lint errors * move description * Add ViTDetBackbone (#1776) * add vit det vit_det_backbone * update docstring * code reformat * fix tests * address review comments * bump year on all files * address review comments * rename backbone * fix tests * change back to ViT * address review comments * update image shape * Add Mix transformer (#1780) * Add MixTransformer * fix testcase * test changes and comments * lint fix * update config list * modify testcase for 2 layers * update input_image_shape -> image_shape (#1785) * update input_image_shape -> image_shape * update docstring example * code reformat * update tests * Create __init__.py (#1788) add missing __init__ file to vit_det * Hack package build script to rename to keras-hub (#1793) This is a temporary way to test out the keras-hub branch. - Does a global rename of all symbols during package build. - Registers the "old" name on symbol export for saving compat. - Adds a github action to publish every commit to keras-hub as a new package. - Removes our descriptions on PyPI temporarily, until we want to message this more broadly. * Add CLIP and T5XXL for StableDiffusionV3 (#1790) * Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`. * Make CLIPTextEncoder as Backbone * Add `T5XXLPreprocessor` and remove `T5XXLTokenizer` Add `CLIPPreprocessor` * Use `tf = None` at the top * Replace manual implementation of `CLIPAttention` with `MultiHeadAttention` * Add Bounding Box Utils (#1791) * Bounding box utils * - Correct test cases * - Remove hard tensorflow dtype * - fix api gen * - Fix import for test cases - Use setup for converters test case * - fix api_gen issue * - FIx api gen * - Fix api gen error * - Correct test cases as per new api changes * mobilenet_v3 added in keras-nlp (#1782) * mobilenet_v3 added in keras-nlp * minor bug fixed in mobilenet_v3_backbone * formatting corrected * refactoring backbone * correct_pad_downsample method added * refactoring backbone * parameters updated * Testcaseupdated, expected output shape corrected * code formatted with black * testcase updated * refactoring and description added * comments updated * added mobilenet v1 and v2 * merge conflict resolved * version arg removed, and config options added * input_shape changed to image_shape in arg * config updated * input shape corrected * comments resolved * activation function format changed * minor bug fixed * minor bug fixed * added vision_backbone_test * channel_first bug resolved * channel_first cases working * comments resolved * formatting fixed * refactoring --------- Co-authored-by: ushareng <[email protected]> * Pkgoogle/efficient net migration (#1778) * migrating efficientnet models to keras-hub * merging changes from other sources * autoformatting pass * initial consolidation of efficientnet_backbone * most updates and removing separate implementation * cleanup, autoformatting, keras generalization * removed layer examples outside of effiicient net * many, mainly documentation changes, small test fixes * Add the ResNet_vd backbone (#1766) * Add ResNet_vd to ResNet backbone * Addressed requested parameter changes * Fixed tests and updated comments * Added new parameters to docstring * Add `VAEImageDecoder` for StableDiffusionV3 (#1796) * Add `VAEImageDecoder` for StableDiffusionV3 * Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention` * Replace `Backbone` with `keras.Model` in `CLIPTextEncoder` and `T5XXLTextEncoder` (#1802) * Add pyramid output for densenet, cspDarknet (#1801) * add pyramid outputs * fix testcase * format fix * make common testcase for pyramid outputs * change default shape * simplify testcase * test case change and add channel axis * Add `MMDiT` for StableDiffusionV3 (#1806) * Add `MMDiT` * Update * Update * Update implementation * Add remaining bbox utils (#1804) * - Add formats, iou, utils for bounding box * - Add `AnchorGenerator`, `BoxMatcher` and `NonMaxSupression` layers * - Remove scope_name not required. * use default keras name scope * - Correct format error * - Remove layers as of now and keep them at model level till keras core supports them * - Correct api_gen * Fix timm conversion for rersnet (#1814) * Fix * Update * Rename to diffuser and decoder * Define functional model * Merge from upstream/master * Delete old SD3 * Fix copyright * Rename to keras_hub * Address comments * Update * Fix CI * Fix bugs occurred in keras3.1 --------- Co-authored-by: Divyashree Sreepathihalli <[email protected]> Co-authored-by: Sachin Prasad <[email protected]> Co-authored-by: Matt Watson <[email protected]> Co-authored-by: Siva Sravana Kumar Neeli <[email protected]> Co-authored-by: Usha Rengaraju <[email protected]> Co-authored-by: ushareng <[email protected]> Co-authored-by: pkgoogle <[email protected]> Co-authored-by: gowthamkpr <[email protected]>

james77777778 added 3 commits August 22, 2024 15:13

Add CLIPTokenizer, T5XXLTokenizer, CLIPTextEncoder and `T5XXLTe…

44783ea

…xtEncoder`.

Merge remote-tracking branch 'upstream/keras-hub' into add-sdv3

f4b9b4a

Make CLIPTextEncoder as Backbone

dcf3ec6

mattdangerw reviewed Aug 23, 2024

View reviewed changes

keras_nlp/src/models/stable_diffusion_v3/t5_xxl_tokenizer.py Outdated Show resolved Hide resolved

james77777778 marked this pull request as draft August 23, 2024 02:50

james77777778 force-pushed the add-sdv3 branch from 05ce995 to ce454fb Compare August 23, 2024 07:17

james77777778 changed the title ~~Add CLIPTokenizer, T5XXLTokenizer, CLIPTextEncoder and T5XXLTextEncoder~~ Add CLIPTokenizer, CLIPPreprocessor, T5XXLPreprocessor, CLIPTextEncoder and T5XXLTextEncoder Aug 23, 2024

james77777778 marked this pull request as ready for review August 23, 2024 07:23

james77777778 requested a review from mattdangerw August 23, 2024 07:23

james77777778 force-pushed the add-sdv3 branch from ce454fb to c9d2f82 Compare August 23, 2024 07:26

Add T5XXLPreprocessor and remove T5XXLTokenizer

c789236

Add `CLIPPreprocessor`

james77777778 force-pushed the add-sdv3 branch from c9d2f82 to c789236 Compare August 23, 2024 07:28

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Aug 23, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Aug 23, 2024

mattdangerw reviewed Aug 23, 2024

View reviewed changes

Use tf = None at the top

7ddf4ec

Replace manual implementation of CLIPAttention with `MultiHeadAtten…

6f38cb4

…tion`

james77777778 force-pushed the add-sdv3 branch from 2620e4c to 6f38cb4 Compare August 25, 2024 05:13

james77777778 requested a review from mattdangerw August 25, 2024 05:29

james77777778 changed the title ~~Add CLIPTokenizer, CLIPPreprocessor, T5XXLPreprocessor, CLIPTextEncoder and T5XXLTextEncoder~~ Add CLIP and T5XXL for StableDiffusionV3 Aug 26, 2024

divyashreepathihalli approved these changes Aug 26, 2024

View reviewed changes

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Aug 26, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Aug 26, 2024

mattdangerw merged commit fdf6b6b into keras-team:keras-hub Aug 26, 2024
12 checks passed

james77777778 deleted the add-sdv3 branch August 27, 2024 00:53

		from keras_nlp.src.models.t5.t5_transformer_layer import T5TransformerLayer


		class T5XXLTextEncoder(Backbone):

Add CLIP and T5XXL for StableDiffusionV3 #1790

Add CLIP and T5XXL for StableDiffusionV3 #1790

Uh oh!

Conversation

james77777778 commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw commented Aug 22, 2024

Uh oh!

Uh oh!

james77777778 commented Aug 23, 2024

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattdangerw Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

james77777778 Aug 24, 2024

Choose a reason for hiding this comment

Uh oh!

mattdangerw Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

james77777778 Aug 24, 2024

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

james77777778 Aug 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattdangerw commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

james77777778 commented Aug 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divyashreepathihalli left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Aug 26, 2024

Uh oh!

Uh oh!

Uh oh!

james77777778 commented Aug 22, 2024 •

edited

Loading

divyashreepathihalli Aug 26, 2024 •

edited

Loading

mattdangerw commented Aug 23, 2024 •

edited

Loading

james77777778 commented Aug 24, 2024 •

edited

Loading