[quantization] use channel scales for w4a8 + misc fixes #23570

czhu-cohere · 2025-08-25T16:46:23Z

Purpose

Load per-channel scales for w4a8. This can recover the quality drop from naively casting bf16 scales to fp8 on certain benchmarks (mmlu pro). Computationally, this is 'free' since the previous implementation used torch.ones as a placeholder.

The fp8 group scales and fp32 per-channel scales can be generated as a post-processing step after we have a w4a16 checkpoint with bf16 scales. For testing we used an adhoc workflow to generate the checkpoint but will look at integrating that into llm-compressor as a next step:

Take bf16 scales from the w4a16 checkpoint
do fp8_scales, fp32_chan_scales = quantfp8(bf16_scales)
divide fp8_scales by 8 to avoid saturation when multiplied by int4
multiply fp32_chan_scales by 8 to compensate

Then pass the adjusted fp8_scales/fp32_chan_scales to the w4a8 kernel.

Test Plan

lm-eval (gsm8k, mmlu pro) compare to w4a16 and previous w4a8 (Cohere Command A)

also add an example model czhu-cohere/TinyLlama-1.1B-Chat-v1.0-W4A8-e2e and add corresponding test in tests/quantization/test_compressed_tensors.py

pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8

Test Result

Cohere Command A (111B)
|model        |gsm8k       |mmlu pro    |
|-------------|------------|------------|
|w4a16        |0.8483699773|0.6907413564|
|w4a8 (before)|0.8476118271|0.6760305851|
|w4a8 (after) |0.8514025777|0.6880817819|

pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8
...
collected 1 item                                                                                                                

test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8[args0] PASSED                                                [100%]

====================================================== 1 passed in 52.44s =======================================================

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: czhu-cohere <[email protected]>

czhu-cohere · 2025-08-25T17:59:26Z

vllm/model_executor/layers/quantization/kernels/mixed_precision/cutlass.py

-        assert bias is None, "bias not supported by CUTLASS W4A8"
        c = self.config
        w_q, w_s, _, _ = self._get_weight_params(layer)
+        w_ch_s = layer.weight_chan_scale


I feel this is better than modifying every place self._get_weight_params is called

Agreed; we eventually need to refactor MPLinearKernel to get rid of _get_weight_params which seems possible with torch 2.8.0 where we may not need to re-register the params in process_weights_after_loading and can just map everything to consistent names in MPLinearKernel

Signed-off-by: czhu-cohere <[email protected]>

LucasWilkinson

Nice work! Thanks for the contribution!

dsikka · 2025-08-26T16:47:32Z

@LucasWilkinson can you attach the ready label so we can run the full quant test

…#23570) Signed-off-by: czhu-cohere <[email protected]> Signed-off-by: tc-mb <[email protected]>

…#23570) Signed-off-by: czhu-cohere <[email protected]>

…#23570) Signed-off-by: czhu-cohere <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…#23570) Signed-off-by: czhu-cohere <[email protected]>

czhu-cohere force-pushed the w4a8_scale_fix branch from 35398d0 to e1cd08e Compare August 25, 2025 16:49

czhu-cohere changed the title ~~[quantization] use channel scales for w4a8 fp8~~ [quantization] use channel scales for w4a8 + misc fixes Aug 25, 2025

channel scales

8d6f5b6

Signed-off-by: czhu-cohere <[email protected]>

czhu-cohere force-pushed the w4a8_scale_fix branch from e1cd08e to 8d6f5b6 Compare August 25, 2025 17:09

czhu-cohere commented Aug 25, 2025

View reviewed changes

czhu-cohere marked this pull request as ready for review August 25, 2025 17:59

czhu-cohere requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 25, 2025 17:59

czhu-cohere added 2 commits August 25, 2025 15:17

compressed tensors test

1b0b4da

Signed-off-by: czhu-cohere <[email protected]>

smaller model

ceee9b3

Signed-off-by: czhu-cohere <[email protected]>

LucasWilkinson approved these changes Aug 26, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025

vllm-bot merged commit 2c2b140 into vllm-project:main Aug 27, 2025
47 of 51 checks passed

tc-mb pushed a commit to tc-mb/vllm that referenced this pull request Aug 27, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

7f011f1

…#23570) Signed-off-by: czhu-cohere <[email protected]> Signed-off-by: tc-mb <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

b3099f0

…#23570) Signed-off-by: czhu-cohere <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

f1efcb3

…#23570) Signed-off-by: czhu-cohere <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

8c32c4c

…#23570) Signed-off-by: czhu-cohere <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

8374a0a

…#23570) Signed-off-by: czhu-cohere <[email protected]>

brian-dellabetta mentioned this pull request Sep 23, 2025

[Usage]: how to make one quantized model(w4a FP8). I used llm-compressor make one. But it not work in vllm 0.10.2. #25241

Open

1 task

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[quantization] use channel scales for w4a8 + misc fixes (vllm-project…

e379754

…#23570) Signed-off-by: czhu-cohere <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[quantization] use channel scales for w4a8 + misc fixes #23570

[quantization] use channel scales for w4a8 + misc fixes #23570

czhu-cohere commented Aug 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

czhu-cohere Aug 25, 2025

Uh oh!

LucasWilkinson Aug 26, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

dsikka commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[quantization] use channel scales for w4a8 + misc fixes #23570

[quantization] use channel scales for w4a8 + misc fixes #23570

Conversation

czhu-cohere commented Aug 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

czhu-cohere Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

czhu-cohere commented Aug 25, 2025 •

edited by github-actions bot

Loading