Skip to content

Conversation

BowenBao
Copy link
Contributor

@BowenBao BowenBao commented Apr 21, 2025

Initial PR to integrate loading MXFP4 models quantized by Quark.
This PR supports running MXFP4 emulation for devices where micro-scaling datatype is not natively supported.

Next Steps

  • MoE MXFP4 support.
  • Faster emulation.
  • Triton kernel integration.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Apr 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 24, 2025
@BowenBao
Copy link
Contributor Author

@mgoin thanks for taking a look! This PR is now ready for review. More PRs will follow.

Comment on lines +64 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the env var and always do weight decompress at runtime? This is the expected behavior from other quantization methods so I feel it is strange to not do compression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found it to be more efficient for emulation evaluations doing aot weight dequant. That being said, this can be removed with the support of more efficient dequant kernels. I would prefer keeping this option for now but let me know if you feel strongly about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason

@mgoin
Copy link
Member

mgoin commented May 1, 2025

Seems reasonable, thanks. Is there a small model for testing that you could add under vllm/tests/models/decoder_only/language/test_mxfp4.py, even if it is disabled/skipped for now?

Copy link

mergify bot commented May 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 1, 2025
@BowenBao
Copy link
Contributor Author

BowenBao commented May 1, 2025

Seems reasonable, thanks. Is there a small model for testing that you could add under vllm/tests/models/decoder_only/language/test_mxfp4.py, even if it is disabled/skipped for now?

Test added. Skipped for now til model is publicly released.

@BowenBao
Copy link
Contributor Author

BowenBao commented May 2, 2025

@mgoin I have addressed most of the comments. Please take a look again, thanks!

fxmarty-amd and others added 6 commits May 5, 2025 18:33
wip

wip & debug

update

cleanup

use quark realquantizer for pack/quant/dequant

comment on cudagraph issue; remove prints

Keep only 1 place importing quark

cudagraph issue resolved; dq weight at load time for efficiency

Signed-off-by: Bowen Bao <[email protected]>

lint

Signed-off-by: Bowen Bao <[email protected]>

turn on emulation based on platform

Signed-off-by: Bowen Bao <[email protected]>

add fused moe support - ugly wip

running version

Add envar if dequant weight at load time

Signed-off-by: Bowen Bao <[email protected]>

Mxfp4 memory leak fixes (#2)

Signed-off-by: Felix Marty <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>
Signed-off-by: Bowen Bao <[email protected]>

Add test

Signed-off-by: Bowen Bao <[email protected]>

revert rope local fix

Signed-off-by: Bowen Bao <[email protected]>

remove print

Signed-off-by: Bowen Bao <[email protected]>

rename scale calculation mode

Signed-off-by: Bowen Bao <[email protected]>
@BowenBao
Copy link
Contributor Author

BowenBao commented May 6, 2025

@mgoin friendly ping. We have a few more PRs lined up after this one, would greatly appreciate if you could take another look!

@mgoin
Copy link
Member

mgoin commented May 6, 2025

Thank you for the ping! Will look today

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me for now as a skeleton. I think it would be good to get at least a basic emulation implementation in as a reference in the future when kernel tests are added, like https://github.com/vllm-project/vllm/blob/621ca2c0aba8268d72d380fa3e479ddafa529479/tests/kernels/quantization/test_nvfp4_quant.py
This can be done in follow up work though

Comment on lines +64 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we can keep it for now, but let us hope to remove over time. We want to try and keep the list from ever-growing unless there is a good reason

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025
@BowenBao
Copy link
Contributor Author

BowenBao commented May 7, 2025

@mgoin makes sense. We will submit follow-ups regarding both suggestions.

@mgoin mgoin merged commit db593aa into vllm-project:main May 7, 2025
77 checks passed
princepride pushed a commit to princepride/vllm that referenced this pull request May 10, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
@fxmarty-amd fxmarty-amd deleted the mxfp4 branch May 19, 2025 14:37
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants