-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[v1] Hybrid Memory Allocator #17996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1] Hybrid Memory Allocator #17996
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@heheda12345 Please rebase. It will help me review this PR. |
Sure. Rebasing right now. |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
91941eb
to
ec55021
Compare
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@WoosukKwon I've rebased this PR. It is ready for an initial code review. I'm working on unit tests and benchmarks now. |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 I need some help to understand this PR. I've spent several hours reading this, but didn't get the clear picture. Let's chat offline.
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…ator_a Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 LGTM! Thanks so much for the tremendous effort on this PR. It must’ve been really tough. Really appreciate your hard work and patience!
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
@classmethod | ||
def create_empty(cls) -> "KVCacheBlocks": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this method was removed?
It's used here, so this change broke main:
vllm/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py
Lines 163 to 164 in 8267f99
c.update_state_after_alloc(request, | |
KVCacheBlocks.create_empty(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I am also looking into the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's replaced by kv_cache_manager.create_empty_block_list()
, because the KVCacheBlocks
class does not know the number of kv cache groups.
Not sure why this is not detected in CI 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI runs only on the branch and the branch was rebased on main prior to the multi-connector change that causes the problem being merged, even though there were no "conflicts".
I can fix this in multi-connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by #19291
assert self.other_block_size % self.full_attention_block_size == 0, ( | ||
"KVCacheCoordinator assumes the block_size of full attention " | ||
"layers is divisible by other layers now.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 - according to the assert message, shouldnt it be self.full_attention_block_size % self.other_block_size == 0
? Or the msg should be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching up this problem. The comment should be updated.
vLLM v0.9.1 contains a bug that causes vllm-spyre to hang on boot-up. The bug is not respecting `num_gpu_blocks_overrides`. It was introduced in vllm-project/vllm#17996 and fixed in vllm-project/vllm#19503. --------- Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Should be merged after worker side change #17945 and kv cache manager changes #17999 #18001 #18003The kv cache manager part of hybrid memory allocator. Most worker side changes in this PR are already included in #17945 and pls only review code inside
vllm/v1/core
at this moment. See #16101 and #13296 for the design.Correctness
model: google/gemma-3-12b-it
this pr, gsm8k
lm_eval --model vllm --tasks gsm8k --model_args pretrained=google/gemma-3-12b-it --batch_size auto
main, gsm8k
uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks gsm8k --model_args pretrained=google/gemma-3-12b-it --batch_size auto
this pr, mmlu
lm_eval --model vllm --tasks mmlu --model_args pretrained=google/gemma-3-12b-it --batch_size auto
main, mmlu
uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks mmlu --model_args pretrained=google/gemma-3-12b-it --batch_size auto
this pr, mmlu_pro
lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=google/gemma-3-12b-it --batch_size auto
main, mmlu_pro
uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=google/gemma-3-12b-it --batch_size auto
Will add performance benchmark result later.