[v1] Hybrid Memory Allocator #17996

heheda12345 · 2025-05-12T14:38:23Z

~~Should be merged after worker side change #17945 and kv cache manager changes #17999 #18001 #18003~~

The kv cache manager part of hybrid memory allocator. Most worker side changes in this PR are already included in #17945 and pls only review code inside vllm/v1/core at this moment. See #16101 and #13296 for the design.

Correctness

model: google/gemma-3-12b-it

this pr, gsm8k

lm_eval --model vllm --tasks gsm8k --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8840	±	0.0088
		strict-match	5	exact_match	↑	0.8772	±	0.0090

main, gsm8k

uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks gsm8k --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8795	±	0.0090
		strict-match	5	exact_match	↑	0.8726	±	0.0092

this pr, mmlu

lm_eval --model vllm --tasks mmlu --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7147	±	0.0036
- humanities	2	none	acc	↑	0.6389	±	0.0065
- other	2	none	acc	↑	0.7673	±	0.0073
- social sciences	2	none	acc	↑	0.8190	±	0.0068
- stem	2	none	acc	↑	0.6743	±	0.0080

main, mmlu

uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks mmlu --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7147	±	0.0036
- humanities	2	none	acc	↑	0.6389	±	0.0065
- other	2	none	acc	↑	0.7673	±	0.0073
- social sciences	2	none	acc	↑	0.8190	±	0.0068
- stem	2	none	acc	↑	0.6743	±	0.0080

this pr, mmlu_pro

lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Groups	Version	Filter	n-shot	Metric		Value		Stderr
mmlu_pro	2	custom-extract		exact_match	↑	0.4879	±	0.0044

main, mmlu_pro

uvx --with vllm --extra-index-url https://wheels.vllm.ai/e60f550b3825cbce2d3c7e882b029e2c1d914d8d lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=google/gemma-3-12b-it --batch_size auto

Groups	Version	Filter	n-shot	Metric		Value		Stderr
mmlu_pro	2	custom-extract		exact_match	↑	0.4875	±	0.0044

Will add performance benchmark result later.

github-actions · 2025-05-12T14:38:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-12T14:39:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-05-15T01:58:29Z

@heheda12345 Please rebase. It will help me review this PR.

heheda12345 · 2025-05-15T02:46:44Z

Sure. Rebasing right now.

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-05-15T06:43:20Z

@WoosukKwon I've rebased this PR. It is ready for an initial code review. I'm working on unit tests and benchmarks now.

Signed-off-by: Chen Zhang <[email protected]>

vllm/config.py

vllm/v1/core/sched/scheduler.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/kv_cache_utils.py

WoosukKwon

@heheda12345 I need some help to understand this PR. I've spent several hours reading this, but didn't get the clear picture. Let's chat offline.

Signed-off-by: Chen Zhang <[email protected]>

…ator_a Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon

@heheda12345 LGTM! Thanks so much for the tremendous effort on this PR. It must’ve been really tough. Really appreciate your hard work and patience!

gemini-code-assist · 2025-06-06T03:47:13Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

tlrmchlsmth · 2025-06-06T12:51:46Z

vllm/v1/core/kv_cache_manager.py

-    @classmethod
-    def create_empty(cls) -> "KVCacheBlocks":


Is there a reason why this method was removed?

It's used here, so this change broke main:

vllm/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py

Lines 163 to 164 in 8267f99

c.update_state_after_alloc(request,

KVCacheBlocks.create_empty(), 0)

yeah I am also looking into the same thing

It's replaced by kv_cache_manager.create_empty_block_list(), because the KVCacheBlocks class does not know the number of kv cache groups.
Not sure why this is not detected in CI 🤔

Looks like the CI runs only on the branch and the branch was rebased on main prior to the multi-connector change that causes the problem being merged, even though there were no "conflicts".

I can fix this in multi-connector.

@njhill Thank you!

Fixed by #19291

ekagra-ranjan · 2025-06-10T23:51:33Z

vllm/v1/core/kv_cache_coordinator.py

+        assert self.other_block_size % self.full_attention_block_size == 0, (
+            "KVCacheCoordinator assumes the block_size of full attention "
+            "layers is divisible by other layers now.")


@heheda12345 - according to the assert message, shouldnt it be self.full_attention_block_size % self.other_block_size == 0 ? Or the msg should be updated?

Thanks for catching up this problem. The comment should be updated.

vLLM v0.9.1 contains a bug that causes vllm-spyre to hang on boot-up. The bug is not respecting `num_gpu_blocks_overrides`. It was introduced in vllm-project/vllm#17996 and fixed in vllm-project/vllm#19503. --------- Signed-off-by: Travis Johnson <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 12, 2025 14:38

mergify bot added the v1 label May 12, 2025

mergify bot added tpu Related to Google TPUs needs-rebase labels May 12, 2025

heheda12345 marked this pull request as draft May 12, 2025 14:58

WoosukKwon mentioned this pull request May 15, 2025

[V1][P/D] Local attention optimization for NIXL #18170

Merged

heheda12345 added 2 commits May 14, 2025 19:49

hybrid allocator

4f81b65

Signed-off-by: Chen Zhang <[email protected]>

refactor

ec55021

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 force-pushed the hybrid_allocator_a branch from 91941eb to ec55021 Compare May 15, 2025 03:05

mergify bot removed the needs-rebase label May 15, 2025

heheda12345 added 3 commits May 14, 2025 20:26

fix bug

41e027a

Signed-off-by: Chen Zhang <[email protected]>

minor updates

0735539

Signed-off-by: Chen Zhang <[email protected]>

minor updates

5e53840

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 marked this pull request as ready for review May 15, 2025 06:41

heheda12345 added 4 commits May 16, 2025 06:01

avoid frequent creation of block_bundle

dcfe6ca

Signed-off-by: Chen Zhang <[email protected]>

update notes

b94ed65

Signed-off-by: Chen Zhang <[email protected]>

fix config

deafbda

Signed-off-by: Chen Zhang <[email protected]>

fix tests in v1/core

18798da

Signed-off-by: Chen Zhang <[email protected]>

WoosukKwon reviewed May 19, 2025

View reviewed changes

heheda12345 added 10 commits June 3, 2025 08:05

pass tests in v1/core

a52d271

Signed-off-by: Chen Zhang <[email protected]>

revert previous change in tests/v1/core

08e0888

Signed-off-by: Chen Zhang <[email protected]>

update worker test

2140dc6

Signed-off-by: Chen Zhang <[email protected]>

test_cache_blocks_multi_group

b63d8ea

Signed-off-by: Chen Zhang <[email protected]>

test_prefill_hybrid_model

b64b8b1

Signed-off-by: Chen Zhang <[email protected]>

revert test_scheduler

5fb5e49

Signed-off-by: Chen Zhang <[email protected]>

revert test_manager

13b486a

Signed-off-by: Chen Zhang <[email protected]>

test_get_kv_cache_config

b598c0e

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_alloc…

ee71bd8

…ator_a Signed-off-by: Chen Zhang <[email protected]>

fix ci

85798c5

Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Jun 4, 2025

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 4, 2025

heheda12345 added 2 commits June 3, 2025 23:16

fix kv connector tests

b5fa8e1

Signed-off-by: Chen Zhang <[email protected]>

small updates

fa2f7bc

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 mentioned this pull request Jun 5, 2025

[Core] Use tuple for kv cache group block ids #19175

Merged

WoosukKwon approved these changes Jun 6, 2025

View reviewed changes

WoosukKwon merged commit f8a1a2d into vllm-project:main Jun 6, 2025
68 checks passed

tlrmchlsmth reviewed Jun 6, 2025

View reviewed changes

NickLucche mentioned this pull request Jun 6, 2025

[CI Failure]: buildkite/ci/v1-test #19281

Closed

3 tasks

njhill mentioned this pull request Jun 6, 2025

[BugFix] Fix MultiConnector test after HMA changes #19291

Merged

sarckk mentioned this pull request Jun 7, 2025

[V1][TPU] Fix TPU kv sharing tests #19155

Closed

ekagra-ranjan reviewed Jun 10, 2025

View reviewed changes

luccafong mentioned this pull request Jun 11, 2025

[Core] Support Local Chunked Attention for Hybrid KV Cache #19351

Merged

tjohnson31415 mentioned this pull request Jun 12, 2025

Exclude vllm v0.9.1 as an allowed version due to breaking bug vllm-project/vllm-spyre#232

Merged

leoli1208 pushed a commit to leoli1208/vllm that referenced this pull request Jul 22, 2025

[v1] Hybrid Memory Allocator (vllm-project#17996)

22e319a

Signed-off-by: Chen Zhang <[email protected]>

	c.update_state_after_alloc(request,
	KVCacheBlocks.create_empty(), 0)

Uh oh!

[v1] Hybrid Memory Allocator #17996

[v1] Hybrid Memory Allocator #17996

Uh oh!

Conversation

heheda12345 commented May 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Correctness

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

WoosukKwon commented May 15, 2025

Uh oh!

heheda12345 commented May 15, 2025

Uh oh!

heheda12345 commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 6, 2025

Uh oh!

tlrmchlsmth Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekagra-ranjan Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

heheda12345 commented May 12, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth Jun 6, 2025 •

edited

Loading

njhill Jun 6, 2025 •

edited

Loading

ekagra-ranjan Jun 10, 2025 •

edited

Loading