Skip to content

feat: pipeline-level quantization config #11130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
May 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
316ff46
feat: pipeline-level quant config.
sayakpaul Mar 10, 2025
eec5b98
Revert "feat: pipeline-level quant config."
sayakpaul Mar 20, 2025
c94d85a
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Mar 20, 2025
4d3dede
feat: implement pipeline-level quantization config
sayakpaul Mar 21, 2025
dc79f32
update
sayakpaul Mar 21, 2025
df749e4
fixes
sayakpaul Mar 21, 2025
d0ad15e
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Mar 21, 2025
f8b514b
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Mar 26, 2025
9250941
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Mar 27, 2025
13d5589
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 2, 2025
f678437
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 10, 2025
5a85871
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 11, 2025
557136d
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 17, 2025
0d9814f
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 21, 2025
f8d1bd1
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 24, 2025
c7e0774
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 25, 2025
6861da1
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 28, 2025
82bcce0
fix validation.
sayakpaul Apr 29, 2025
78f134b
add tests and other improvements.
sayakpaul Apr 29, 2025
f2b39e0
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul Apr 29, 2025
3b76e0a
add tests
sayakpaul Apr 29, 2025
695061b
import quality
sayakpaul Apr 29, 2025
9693251
remove prints.
sayakpaul Apr 29, 2025
73f1ad1
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 1, 2025
dc90b06
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 2, 2025
872c91e
add docs.
sayakpaul May 2, 2025
fbdf4c6
fixes to docs.
sayakpaul May 2, 2025
da6df86
doc fixes.
sayakpaul May 2, 2025
9a418a9
doc fixes.
sayakpaul May 2, 2025
5b6ee10
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 3, 2025
478a353
add validation to the input quantization_config.
sayakpaul May 6, 2025
f96bcc7
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 6, 2025
0ae2a9a
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 8, 2025
d6b48ea
clarify recommendations.
sayakpaul May 8, 2025
ca2e116
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 8, 2025
ffb974f
docs
sayakpaul May 8, 2025
86ee773
add to ci.
sayakpaul May 8, 2025
037a68b
Merge branch 'main' into feat/pipeline-quant-config
sayakpaul May 9, 2025
7b8a73d
todo.
sayakpaul May 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions .github/workflows/nightly_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -525,118 +525,172 @@
pip install slack_sdk tabulate
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY

run_nightly_pipeline_level_quantization_tests:
name: Torch quantization nightly tests
strategy:
fail-fast: false
max-parallel: 2
runs-on:
group: aws-g6e-xlarge-plus
container:
image: diffusers/diffusers-pytorch-cuda
options: --shm-size "20gb" --ipc host --gpus 0
steps:
- name: Checkout diffusers
uses: actions/checkout@v3
with:
fetch-depth: 2
- name: NVIDIA-SMI
run: nvidia-smi
- name: Install dependencies
run: |
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
python -m uv pip install -e [quality,test]
python -m uv pip install -U bitsandbytes optimum_quanto
python -m uv pip install pytest-reportlog
- name: Environment
run: |
python utils/print_env.py
- name: Pipeline-level quantization tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
CUBLAS_WORKSPACE_CONFIG: :16:8
BIG_GPU_MEMORY: 40
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
--make-reports=tests_pipeline_level_quant_torch_cuda \
--report-log=tests_pipeline_level_quant_torch_cuda.log \
tests/quantization/test_pipeline_level_quantization.py
- name: Failure short reports
if: ${{ failure() }}
run: |
cat reports/tests_pipeline_level_quant_torch_cuda_stats.txt
cat reports/tests_pipeline_level_quant_torch_cuda_failures_short.txt
- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v4
with:
name: torch_cuda_pipeline_level_quant_reports
path: reports
- name: Generate Report and Notify Channel
if: always()
run: |
pip install slack_sdk tabulate
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY

# M1 runner currently not well supported
# TODO: (Dhruv) add these back when we setup better testing for Apple Silicon
# run_nightly_tests_apple_m1:
# name: Nightly PyTorch MPS tests on MacOS
# runs-on: [ self-hosted, apple-m1 ]
# if: github.event_name == 'schedule'
#
# steps:
# - name: Checkout diffusers
# uses: actions/checkout@v3
# with:
# fetch-depth: 2
#
# - name: Clean checkout
# shell: arch -arch arm64 bash {0}
# run: |
# git clean -fxd
# - name: Setup miniconda
# uses: ./.github/actions/setup-miniconda
# with:
# python-version: 3.9
#
# - name: Install dependencies
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python -m pip install --upgrade pip uv
# ${CONDA_RUN} python -m uv pip install -e [quality,test]
# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate
# ${CONDA_RUN} python -m uv pip install pytest-reportlog
# - name: Environment
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python utils/print_env.py
# - name: Run nightly PyTorch tests on M1 (MPS)
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
# tests/
# - name: Failure short reports
# if: ${{ failure() }}
# run: cat reports/tests_torch_mps_failures_short.txt
#
# - name: Test suite reports artifacts
# if: ${{ always() }}
# uses: actions/upload-artifact@v4
# with:
# name: torch_mps_test_reports
# path: reports
#
# - name: Generate Report and Notify Channel
# if: always()
# run: |
# pip install slack_sdk tabulate
# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY run_nightly_tests_apple_m1:
# name: Nightly PyTorch MPS tests on MacOS
# runs-on: [ self-hosted, apple-m1 ]
# if: github.event_name == 'schedule'
#
# steps:
# - name: Checkout diffusers
# uses: actions/checkout@v3
# with:
# fetch-depth: 2
#
# - name: Clean checkout
# shell: arch -arch arm64 bash {0}
# run: |
# git clean -fxd
# - name: Setup miniconda
# uses: ./.github/actions/setup-miniconda
# with:
# python-version: 3.9
#
# - name: Install dependencies
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python -m pip install --upgrade pip uv
# ${CONDA_RUN} python -m uv pip install -e [quality,test]
# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate
# ${CONDA_RUN} python -m uv pip install pytest-reportlog
# - name: Environment
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python utils/print_env.py
# - name: Run nightly PyTorch tests on M1 (MPS)
# shell: arch -arch arm64 bash {0}
# env:
# HF_HOME: /System/Volumes/Data/mnt/cache
# HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
# run: |
# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \
# --report-log=tests_torch_mps.log \
# tests/
# - name: Failure short reports
# if: ${{ failure() }}
# run: cat reports/tests_torch_mps_failures_short.txt
#
# - name: Test suite reports artifacts
# if: ${{ always() }}
# uses: actions/upload-artifact@v4
# with:
# name: torch_mps_test_reports
# path: reports
#
# - name: Generate Report and Notify Channel
# if: always()
# run: |
# pip install slack_sdk tabulate
# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
7 changes: 4 additions & 3 deletions docs/source/en/api/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,17 @@ specific language governing permissions and limitations under the License.

# Quantization

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index).

Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class.
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference.

<Tip>

Learn how to quantize models in the [Quantization](../quantization/overview) guide.

</Tip>

## PipelineQuantizationConfig

[[autodoc]] quantizers.PipelineQuantizationConfig

## BitsAndBytesConfig

Expand Down
87 changes: 87 additions & 0 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,90 @@ Diffusers currently supports the following quantization methods.
- [Quanto](./quanto.md)

[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.

## Pipeline-level quantization

Diffusers allows users to directly initialize pipelines from checkpoints that may contain quantized models ([example](https://huggingface.co/hf-internal-testing/flux.1-dev-nf4-pkg)). However, users may want to apply
quantization on-the-fly when initializing a pipeline from a pre-trained and non-quantized checkpoint. You can
do this with [`~quantizers.PipelineQuantizationConfig`].

Start by defining a `PipelineQuantizationConfig`:

```py
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers.quantization_config import QuantoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig

pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer": QuantoConfig(weights_dtype="int8"),
"text_encoder_2": BitsAndBytesConfig(
load_in_4bit=True, compute_dtype=torch.bfloat16
),
}
)
```

Then pass it to [`~DiffusionPipeline.from_pretrained`] and run inference:

```py
pipe = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]
```

This method allows for more granular control over the quantization specifications of individual
model-level components of a pipeline. It also allows for different quantization backends for
different components. In the above example, you used a combination of Quanto and BitsandBytes. However,
one caveat of this method is that users need to know which components come from `transformers` to be able
to import the right quantization config class.

The other method is simpler in terms of experience but is
less-flexible. Start by defining a `PipelineQuantizationConfig` but in a different way:

```py
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder_2"],
)
```

This `pipeline_quant_config` can now be passed to [`~DiffusionPipeline.from_pretrained`] similar to the above example.

In this case, `quant_kwargs` will be used to initialize the quantization specifications
of the respective quantization configuration class of `quant_backend`. `components_to_quantize`
is used to denote the components that will be quantized. For most pipelines, you would want to
keep `transformer` in the list as that is often the most compute and memory intensive.

The config below will work for most diffusion pipelines that have a `transformer` component present.
In most case, you will want to quantize the `transformer` component as that is often the most compute-
intensive part of a diffusion pipeline.

```py
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer"],
)
```

Below is a list of the supported quantization backends available in both `diffusers` and `transformers`:

* `bitsandbytes_4bit`
* `bitsandbytes_8bit`
* `gguf`
* `quanto`
* `torchao`


Diffusion pipelines can have multiple text encoders. [`FluxPipeline`] has two, for example. It's
recommended to quantize the text encoders that are memory-intensive. Some examples include T5,
Llama, Gemma, etc. In the above example, you quantized the T5 model of [`FluxPipeline`] through
`text_encoder_2` while keeping the CLIP model intact (accessible through `text_encoder`).
13 changes: 13 additions & 0 deletions src/diffusers/pipelines/pipeline_loading_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -675,8 +675,10 @@ def load_sub_model(
use_safetensors: bool,
dduf_entries: Optional[Dict[str, DDUFEntry]],
provider_options: Any,
quantization_config: Optional[Any] = None,
):
"""Helper method to load the module `name` from `library_name` and `class_name`"""
from ..quantizers import PipelineQuantizationConfig

# retrieve class candidates

Expand Down Expand Up @@ -769,6 +771,17 @@ def load_sub_model(
else:
loading_kwargs["low_cpu_mem_usage"] = False

if (
quantization_config is not None
and isinstance(quantization_config, PipelineQuantizationConfig)
and issubclass(class_obj, torch.nn.Module)
):
model_quant_config = quantization_config._resolve_quant_config(
is_diffusers=is_diffusers_model, module_name=name
)
if model_quant_config is not None:
loading_kwargs["quantization_config"] = model_quant_config

# check if the module is in a subdirectory
if dduf_entries:
loading_kwargs["dduf_entries"] = dduf_entries
Expand Down
6 changes: 6 additions & 0 deletions src/diffusers/pipelines/pipeline_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
from ..models import AutoencoderKL
from ..models.attention_processor import FusedAttnProcessor2_0
from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, ModelMixin
from ..quantizers import PipelineQuantizationConfig
from ..quantizers.bitsandbytes.utils import _check_bnb_status
from ..schedulers.scheduling_utils import SCHEDULER_CONFIG_NAME
from ..utils import (
Expand Down Expand Up @@ -725,6 +726,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
use_safetensors = kwargs.pop("use_safetensors", None)
use_onnx = kwargs.pop("use_onnx", None)
load_connected_pipeline = kwargs.pop("load_connected_pipeline", False)
quantization_config = kwargs.pop("quantization_config", None)

if torch_dtype is not None and not isinstance(torch_dtype, dict) and not isinstance(torch_dtype, torch.dtype):
torch_dtype = torch.float32
Expand All @@ -741,6 +743,9 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
" install accelerate\n```\n."
)

if quantization_config is not None and not isinstance(quantization_config, PipelineQuantizationConfig):
raise ValueError("`quantization_config` must be an instance of `PipelineQuantizationConfig`.")

if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
raise NotImplementedError(
"Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
Expand Down Expand Up @@ -1001,6 +1006,7 @@ def load_module(name, value):
use_safetensors=use_safetensors,
dduf_entries=dduf_entries,
provider_options=provider_options,
quantization_config=quantization_config,
)
logger.info(
f"Loaded {name} as {class_name} from `{name}` subfolder of {pretrained_model_name_or_path}."
Expand Down
Loading