-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Quantization] Quark MXFP4 format loading #16943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
@mgoin thanks for taking a look! This PR is now ready for review. More PRs will follow. |
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the env var and always do weight decompress at runtime? This is the expected behavior from other quantization methods so I feel it is strange to not do compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found it to be more efficient for emulation evaluations doing aot weight dequant. That being said, this can be removed with the support of more efficient dequant kernels. I would prefer keeping this option for now but let me know if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason
vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Outdated
Show resolved
Hide resolved
Seems reasonable, thanks. Is there a small model for testing that you could add under |
This pull request has merge conflicts that must be resolved before it can be |
Test added. Skipped for now til model is publicly released. |
@mgoin I have addressed most of the comments. Please take a look again, thanks! |
wip wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <[email protected]> lint Signed-off-by: Bowen Bao <[email protected]> turn on emulation based on platform Signed-off-by: Bowen Bao <[email protected]> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <[email protected]> Mxfp4 memory leak fixes (#2) Signed-off-by: Felix Marty <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>
Signed-off-by: Bowen Bao <[email protected]> Add test Signed-off-by: Bowen Bao <[email protected]> revert rope local fix Signed-off-by: Bowen Bao <[email protected]> remove print Signed-off-by: Bowen Bao <[email protected]> rename scale calculation mode Signed-off-by: Bowen Bao <[email protected]>
@mgoin friendly ping. We have a few more PRs lined up after this one, would greatly appreciate if you could take another look! |
Thank you for the ping! Will look today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me for now as a skeleton. I think it would be good to get at least a basic emulation implementation in as a reference in the future when kernel tests are added, like https://github.com/vllm-project/vllm/blob/621ca2c0aba8268d72d380fa3e479ddafa529479/tests/kernels/quantization/test_nvfp4_quant.py
This can be done in follow up work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason
@mgoin makes sense. We will submit follow-ups regarding both suggestions. |
Signed-off-by: 汪志鹏 <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: Yuqi Zhang <[email protected]>
Initial PR to integrate loading MXFP4 models quantized by Quark.
This PR supports running MXFP4 emulation for devices where micro-scaling datatype is not natively supported.
Next Steps