-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Bugfix] Fix Qwen3 MoE GPTQ inference #23490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for GPTQ inference on Qwen3 MoE models. The core change is to prevent the quantization of the 'gate' layer within the MoE blocks when using GPTQ or GPTQ-Marlin quantization. This is achieved by adding a check that passes a None
quantization configuration to the gate's ReplicatedLinear
layer for these specific quantization methods. This is a common practice for MoE models where the gating mechanism is sensitive and often kept at a higher precision. The changes are well-contained and appear correct for the stated purpose. I have not found any high or critical severity issues in this pull request.
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Purpose
Test Plan
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.