Skip to content

Conversation

zifeitong
Copy link
Contributor

@zifeitong zifeitong commented Aug 24, 2025

This PR makes sure quantized model weights are loaded correctly. Currently, loading RedHatAI/Qwen2.5-VL-7B-Instruct-FP8-Dynamic will crash on A100s:

[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 661, in process_weights_after_loading
[core.py:708]     layer.scheme.process_weights_after_loading(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 59, in process_weights_after_loading
[core.py:708]     prepare_fp8_layer_for_marlin(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 107, in prepare_fp8_layer_for_marlin
[core.py:708]     marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight,
[core.py:708]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 938, in gptq_marlin_repack
[core.py:708]     return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n,
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
[core.py:708]     return self._op(*args, **kwargs)
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708] RuntimeError: size_n = 6840 is not divisible by tile_n_size = 64

The issue is introduced in #22066 which changed the model implementation to use MergedColumnParallelLinear layer and pack 'gate_proj' and 'up_proj' params.

Purpose

Test Plan

Test Result

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@zifeitong zifeitong requested a review from sighingnow as a code owner August 24, 2025 22:49
@mergify mergify bot added the qwen Related to Qwen models label Aug 24, 2025
This PR makes sure quantized model weights are loaded correctly. Currently,
loading `RedHatAI/Qwen2.5-VL-7B-Instruct-FP8-Dynamic` will crash an A100s:
```
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 661, in process_weights_after_loading
[core.py:708]     layer.scheme.process_weights_after_loading(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 59, in process_weights_after_loading
[core.py:708]     prepare_fp8_layer_for_marlin(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 107, in prepare_fp8_layer_for_marlin
[core.py:708]     marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight,
[core.py:708]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 938, in gptq_marlin_repack
[core.py:708]     return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n,
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
[core.py:708]     return self._op(*args, **kwargs)
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708] RuntimeError: size_n = 6840 is not divisible by tile_n_size = 64
```

The issue is introduced in vllm-project#22066 which changed the model implementation to use MergedColumnParallelLinear
layer and pack 'gate_proj' and 'up_proj' params.

Signed-off-by: Zifei Tong <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a crash that occurs when loading quantized Qwen2.5-VL models. The issue stems from a missing packed_modules_mapping for the gate_up_proj layer, which was introduced when the model was updated to use a MergedColumnParallelLinear layer. By adding the necessary mapping, this change ensures that the quantization logic can correctly handle the packed weights, resolving the shape mismatch error during the gptq_marlin_repack operation. The fix is targeted, necessary, and follows the established pattern in vLLM for supporting packed layers in quantized models. The change is approved.

@zifeitong zifeitong changed the title Fix Qwen2.5-VL quantized model weights loading [Bugfix] Fix Qwen2.5-VL quantized model weights loading Aug 24, 2025
Signed-off-by: Zifei Tong <[email protected]>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 24, 2025
@DarkLight1337 DarkLight1337 merged commit a71e476 into vllm-project:main Aug 25, 2025
48 checks passed
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Sep 4, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants