[Bugfix] Fix Qwen2.5-VL quantized model weights loading #23512

zifeitong · 2025-08-24T22:49:17Z

This PR makes sure quantized model weights are loaded correctly. Currently, loading RedHatAI/Qwen2.5-VL-7B-Instruct-FP8-Dynamic will crash on A100s:

[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 661, in process_weights_after_loading
[core.py:708]     layer.scheme.process_weights_after_loading(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 59, in process_weights_after_loading
[core.py:708]     prepare_fp8_layer_for_marlin(layer)
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 107, in prepare_fp8_layer_for_marlin
[core.py:708]     marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight,
[core.py:708]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 938, in gptq_marlin_repack
[core.py:708]     return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n,
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708]   File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
[core.py:708]     return self._op(*args, **kwargs)
[core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
[core.py:708] RuntimeError: size_n = 6840 is not divisible by tile_n_size = 64

The issue is introduced in #22066 which changed the model implementation to use MergedColumnParallelLinear layer and pack 'gate_proj' and 'up_proj' params.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

This PR makes sure quantized model weights are loaded correctly. Currently, loading `RedHatAI/Qwen2.5-VL-7B-Instruct-FP8-Dynamic` will crash an A100s: ``` [core.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 661, in process_weights_after_loading [core.py:708] layer.scheme.process_weights_after_loading(layer) [core.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py", line 59, in process_weights_after_loading [core.py:708] prepare_fp8_layer_for_marlin(layer) [core.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 107, in prepare_fp8_layer_for_marlin [core.py:708] marlin_qweight = ops.gptq_marlin_repack(b_q_weight=qweight, [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [core.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 938, in gptq_marlin_repack [core.py:708] return torch.ops._C.gptq_marlin_repack(b_q_weight, perm, size_k, size_n, [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [core.py:708] File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__ [core.py:708] return self._op(*args, **kwargs) [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^ [core.py:708] RuntimeError: size_n = 6840 is not divisible by tile_n_size = 64 ``` The issue is introduced in vllm-project#22066 which changed the model implementation to use MergedColumnParallelLinear layer and pack 'gate_proj' and 'up_proj' params. Signed-off-by: Zifei Tong <[email protected]>

gemini-code-assist

Code Review

This pull request correctly addresses a crash that occurs when loading quantized Qwen2.5-VL models. The issue stems from a missing packed_modules_mapping for the gate_up_proj layer, which was introduced when the model was updated to use a MergedColumnParallelLinear layer. By adding the necessary mapping, this change ensures that the quantization logic can correctly handle the packed weights, resolving the shape mismatch error during the gptq_marlin_repack operation. The fix is targeted, necessary, and follows the established pattern in vLLM for supporting packed layers in quantized models. The change is approved.

Signed-off-by: Zifei Tong <[email protected]>

ywang96

Thanks for the fix!

…#23512) Signed-off-by: Zifei Tong <[email protected]>

…#23512) Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…#23512) Signed-off-by: Zifei Tong <[email protected]>

…#23512) Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>

…#23512) Signed-off-by: Zifei Tong <[email protected]>

zifeitong requested a review from sighingnow as a code owner August 24, 2025 22:49

mergify bot added the qwen Related to Qwen models label Aug 24, 2025

zifeitong force-pushed the fix_qwen_25_vl branch from 819d603 to 66f61ce Compare August 24, 2025 22:50

gemini-code-assist bot reviewed Aug 24, 2025

View reviewed changes

zifeitong changed the title ~~Fix Qwen2.5-VL quantized model weights loading~~ [Bugfix] Fix Qwen2.5-VL quantized model weights loading Aug 24, 2025

Fix format

83a4945

Signed-off-by: Zifei Tong <[email protected]>

zifeitong force-pushed the fix_qwen_25_vl branch from fe0832f to 83a4945 Compare August 24, 2025 23:00

ywang96 approved these changes Aug 24, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 24, 2025

DarkLight1337 merged commit a71e476 into vllm-project:main Aug 25, 2025
48 checks passed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

eac8ee2

…#23512) Signed-off-by: Zifei Tong <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

7d130c2

…#23512) Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

664007b

…#23512) Signed-off-by: Zifei Tong <[email protected]>

mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

4e3b51e

…#23512) Signed-off-by: Zifei Tong <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

bc9cedd

…#23512) Signed-off-by: Zifei Tong <[email protected]>

ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Sep 4, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

2362383

…#23512) Signed-off-by: Zifei Tong <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>

JaheimLee mentioned this pull request Sep 15, 2025

[Bug]: 4xH800 Qwen/Qwen3-Next-80B-A3B-Instruct MTP, benchmark failed mixed_qkv_spec.view shape '[5, -1, 2048]' is invalid for input of size 104448 #24730

Closed

1 task

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix] Fix Qwen2.5-VL quantized model weights loading (vllm-project…

a28e105

…#23512) Signed-off-by: Zifei Tong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix Qwen2.5-VL quantized model weights loading #23512

[Bugfix] Fix Qwen2.5-VL quantized model weights loading #23512

Uh oh!

zifeitong commented Aug 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ywang96 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Bugfix] Fix Qwen2.5-VL quantized model weights loading #23512

[Bugfix] Fix Qwen2.5-VL quantized model weights loading #23512

Uh oh!

Conversation

zifeitong commented Aug 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zifeitong commented Aug 24, 2025 •

edited by github-actions bot

Loading