Skip to content

Conversation

czhu-cohere
Copy link
Contributor

@czhu-cohere czhu-cohere commented Aug 25, 2025

Purpose

Load per-channel scales for w4a8. This can recover the quality drop from naively casting bf16 scales to fp8 on certain benchmarks (mmlu pro). Computationally, this is 'free' since the previous implementation used torch.ones as a placeholder.

The fp8 group scales and fp32 per-channel scales can be generated as a post-processing step after we have a w4a16 checkpoint with bf16 scales. For testing we used an adhoc workflow to generate the checkpoint but will look at integrating that into llm-compressor as a next step:

  1. Take bf16 scales from the w4a16 checkpoint
  2. do fp8_scales, fp32_chan_scales = quantfp8(bf16_scales)
  3. divide fp8_scales by 8 to avoid saturation when multiplied by int4
  4. multiply fp32_chan_scales by 8 to compensate

Then pass the adjusted fp8_scales/fp32_chan_scales to the w4a8 kernel.

Test Plan

lm-eval (gsm8k, mmlu pro) compare to w4a16 and previous w4a8 (Cohere Command A)

also add an example model czhu-cohere/TinyLlama-1.1B-Chat-v1.0-W4A8-e2e and add corresponding test in tests/quantization/test_compressed_tensors.py

pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8

Test Result

Cohere Command A (111B)
|model        |gsm8k       |mmlu pro    |
|-------------|------------|------------|
|w4a16        |0.8483699773|0.6907413564|
|w4a8 (before)|0.8476118271|0.6760305851|
|w4a8 (after) |0.8514025777|0.6880817819|
pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8
...
collected 1 item                                                                                                                

test_compressed_tensors.py::test_compressed_tensors_w4a8_fp8[args0] PASSED                                                [100%]

====================================================== 1 passed in 52.44s =======================================================

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@czhu-cohere czhu-cohere changed the title [quantization] use channel scales for w4a8 fp8 [quantization] use channel scales for w4a8 + misc fixes Aug 25, 2025
Signed-off-by: czhu-cohere <[email protected]>
assert bias is None, "bias not supported by CUTLASS W4A8"
c = self.config
w_q, w_s, _, _ = self._get_weight_params(layer)
w_ch_s = layer.weight_chan_scale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is better than modifying every place self._get_weight_params is called

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; we eventually need to refactor MPLinearKernel to get rid of _get_weight_params which seems possible with torch 2.8.0 where we may not need to re-register the params in process_weights_after_loading and can just map everything to consistent names in MPLinearKernel

@czhu-cohere czhu-cohere marked this pull request as ready for review August 25, 2025 17:59
Signed-off-by: czhu-cohere <[email protected]>
Signed-off-by: czhu-cohere <[email protected]>
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thanks for the contribution!

@dsikka
Copy link
Contributor

dsikka commented Aug 26, 2025

@LucasWilkinson can you attach the ready label so we can run the full quant test

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 26, 2025
@vllm-bot vllm-bot merged commit 2c2b140 into vllm-project:main Aug 27, 2025
47 of 51 checks passed
tc-mb pushed a commit to tc-mb/vllm that referenced this pull request Aug 27, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants