-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Enable 4bit bnb prequant MOE #21548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable 4bit bnb prequant MOE #21548
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to enable 4bit bnb prequant MOE. The changes involve modifying the validation logic for MoE quantization and updating a model implementation to selectively quantize the MoE router gate.
My review identified a critical issue in the validation logic that could lead to a runtime crash, and a high-severity design issue in the model implementation that impacts maintainability and correctness. I've provided suggestions to fix both issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change incorrectly allows 8-bit MoE quantization when pre_quant
is false. The BitsAndBytesMoEMethod
for 8-bit weights is not implemented and raises a NotImplementedError
(see vllm/model_executor/layers/quantization/bitsandbytes.py
). This will lead to a runtime crash for a configuration that this check now allows.
To enable 4-bit pre-quantized MoE while correctly disallowing 8-bit MoE, you should only check for self.load_8bit
. The original check for self.pre_quant
was what blocked 4-bit pre-quant, and removing it is correct, but the self.load_8bit
check should remain as it was to prevent using the unimplemented 8-bit MoE path.
if self.load_8bit:
raise ValueError(
"BitsAndBytes 8bit quantization with FusedMoE is not "
"supported yet.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation for deciding whether to quantize the MoE gate has a couple of issues:
- Hardcoded Logic: It specifically checks for
BitsAndBytesConfig
, making it difficult to extend to other quantization methods that might also support gate quantization. This can lead to silent failures where a new quantization method is expected to quantize the gate but doesn't. A more extensible approach would be to have a method on theQuantizationConfig
itself to indicate this capability. - Incorrect Type Hint: The type hint for
quant_config
in_maybe_ignore_quant_config
isQuantizationConfig
, but it can beNone
at the call site. - Clarity: The function name
_maybe_ignore_quant_config
is a bit ambiguous.
self.gate = ReplicatedLinear(config.hidden_size,
config.num_experts,
bias=False,
quant_config=self._get_gate_quant_config(quant_config),
prefix=f"{prefix}.gate")
def _get_gate_quant_config(
self, quant_config: Optional[QuantizationConfig]
) -> Optional[QuantizationConfig]:
# The gate is only quantized for BitsAndBytes for now.
if isinstance(quant_config, BitsAndBytesConfig):
return quant_config
return None
128681c
to
ddc9e76
Compare
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,thank you for your contribution
Please add a PR description. Thanks! |
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update