-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Attention] Unify mamba and attention backend selection #23171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] Unify mamba and attention backend selection #23171
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively unifies the attention backend selection logic for Mamba and standard attention layers, which is a great improvement for code maintainability and pluggability. The changes in the model runner and layer classes are well-implemented. My main feedback is on the testing side, where I've identified an opportunity to consolidate and improve the new tests for better clarity and correctness.
a9d32c9
to
f648aab
Compare
f648aab
to
13ce598
Compare
This pull request has merge conflicts that must be resolved before it can be |
I think this is a good and cleaner approach overall. Lets let @tdoublep / @heheda12345 give their opinion as well |
13ce598
to
6c756ad
Compare
6c756ad
to
956451c
Compare
956451c
to
e0b056e
Compare
Hi @heheda12345 @LucasWilkinson @Josephasafg, this PR needed rebase because |
Hi @LucasWilkinson @heheda12345 @Josephasafg, this is a gentle reminder to take a look at this PR. I have added all the suggested changes and addressed all the concerns regarding this PR. Please let me know if anything else needs a change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Only a few small comments.
Use Layer.get_attn_backend() interface for all layer types instead of separate mamba-specific backend selection logic. Signed-off-by: Ayush Satyam <[email protected]>
cb65706
to
9901487
Compare
Thanks @heheda12345 for reviewing again. I addressed your comments and there was another merge conflict that needed to be addressed so I fixed that too. PTAL when you get time, Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the clean-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks for doing this!
The CI checks have failed @heheda12345 @LucasWilkinson and I am not sure if it is because of my changes or not. What should I do in this case? Am I supposed to patch a fix for this or is this a flaky CI check? |
…#23171) Signed-off-by: Ayush Satyam <[email protected]>
…#23171) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…#23171) Signed-off-by: Ayush Satyam <[email protected]>
…#23171) Signed-off-by: Ayush Satyam <[email protected]>
…#23171) Signed-off-by: Ayush Satyam <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>
…#23171) Signed-off-by: Ayush Satyam <[email protected]>
Purpose
Fixes #23073. This PR unifies mamba and attention backend selection logic by removing the separate
get_mamba_attn_backend()
function and implementing the standardLayer.get_attn_backend()
interface for all layer types.Fixes the issue where mamba models used duplicate backend selection logic instead of the unified approach.
Changes Made:
vllm/v1/attention/backends/mamba_selectors.py
to eliminate duplicate codeMambaMixer
andMambaMixer2
classes to implementget_attn_backend()
methodgpu_model_runner.py
to use the unifiedget_attn_backends_for_layers()
function for all layer typesThis makes mamba models more pluggable and follows the same pattern as regular attention layers:
MambaMixer.get_attn_backend()
→Mamba1AttentionBackend
MambaMixer2.get_attn_backend()
→Mamba2AttentionBackend