-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add routed_scaling_factor to MoE grouped topk #23123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the routed_scaling_factor
to the MoE grouped top-k logic, which is a valuable addition for controlling expert contributions. The changes are propagated through many layers, including various quantization methods.
However, I've noticed some inconsistencies in the application of routed_scaling_factor
. While it's correctly applied in some paths (like grouped_topk
and the standard top-k in SGLFusedMOE
), it's missing in others. For example, in vllm/model_executor/layers/fused_moe/cpu_fused_moe.py
, the routed_scaling_factor
is added to IPEXFusedMOE.__call__
but is not passed to the underlying ipex_fusion
function, making it an unused parameter. Similar omissions exist for non-grouped top-k paths in other files, which are outside the scope of the current diff but should be addressed for consistency.
For correctness and consistency, I recommend applying the routed_scaling_factor
to all routing paths or clarifying if this omission is intentional. The issue with IPEXFusedMOE
should also be addressed.
2a6cdfb
to
a2d24d5
Compare
f8820de
to
9f16b11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just would like some assertions or implementations added where the arg is ignored at the moment. I think this would be helped out if we made it an Optional[float] and defaulted to None. This would also allow us to skip the multiply in the noop case
6ef2b16
to
03254df
Compare
Signed-off-by: Xin Yang <[email protected]>
Signed-off-by: Xin Yang <[email protected]>
Signed-off-by: Xin Yang <[email protected]>
Signed-off-by: Xin Yang <[email protected]>
@mgoin Since the routed_scaling_factor doesn't default to None in model config, see here, I still keep routed_scaling_factor default to 1.0, to keep the type consistent. But I have added the check to skip the multiply if routed_scaling_factor is 1.0:
Please let me know if this works. Thanks! |
Could you run an eval on deepseek? A bit worried of leaving something behind here since we don't have great quantized moe tests at the moment |
Thanks for your review! I have run the eval based on this PR and #23274 combined. Eval result posted in #23274. Pasting result here as well:
Baseline:
This PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
I noticed some buggy/weird outputs when building from main and running DeepSeek w4a16, and eventually narrowed it down to this commit. I've tested on H100, H200, and B200, and the issue is the same. To reproduce: vllm serve RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --tensor-parallel-size 8 --max-model-len 8192 And send a simple request like "Respond with 'cat' and nothing more", and about 10% of the time it'll go haywire and output a whole essay about cats or something. Then test building vLLM before this commit, and it works fine - i.e. responds with 'cat' 100% of the time. |
#24118 |
@yewentao256 I can revert this PR, please let me know if it works. Thanks. |
|
Signed-off-by: Xin Yang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Xin Yang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Purpose
This PR add
routed_scaling_factor
to MoE grouped topk.vllm/model_executor/layers/fused_moe/layer.py
, addrouted_scaling_factor
parameter togrouped_topk()
.vllm/model_executor/layers/quantization/fp8.py
, passrouted_scaling_factor
totorch.ops.vllm.flashinfer_fused_moe_blockscale_fp8()
instead of hardcoded 1.0.Note: rocm aiter grouped topk already has
routed_scaling_factor
, see here.transformers reference: https://github.com/huggingface/transformers/blob/v4.55.3/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L145
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.