Skip to content

Conversation

ApostaC
Copy link
Collaborator

@ApostaC ApostaC commented Aug 26, 2025

Inspired by the Dynamo team, this PR adds support for the "async remote lookup" into the connector API.

Purpose

For remote KV cache connectors, it may need some RPC calls to look up the number of remote hit tokens. However, this may slow down the scheduler loop and cause performance degradation on prefill heavy workloads.

This PR gets the first step towards addressing this problem -- it introduces a new semantics of "try again" in get_num_new_matched_tokens function.

When calling the function, it may return None as the number of matched tokens, indicating that the connector needs more time to look up the hit tokens for this request. In this case, the scheduler should schedule other requests first.

Test Plan

There is a PoC showing how much scheduling overhead this PR can reduce in #23622

Test Result

see #23622

Follow-ups


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an important enhancement to the KV connector API by allowing get_num_new_matched_tokens to return None, indicating an asynchronous lookup is in progress. This change will help prevent the scheduler from being blocked by remote KV cache lookups. The implementation in the scheduler to handle this new None state appears correct. However, I've found a critical issue in MultiConnector where the change in the return type of get_num_new_matched_tokens was not correctly handled in its implementation, which will lead to a TypeError. Please see my detailed comment.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Aug 26, 2025

@njhill @robertgshaw2-redhat This is the one I mentioned today. Relatively simple change just to open up the new semantics.
Feel free to review and leave your comments, thanks!

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ApostaC, LGTM.

Would be good to get @ryanolson's thoughts too

Comment on lines +148 to +151
# If there is a connector still looking up the matches,
# we return None to indicate that we are not done yet.
if toks is None:
return (None, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable but in theory you might prefer a different policy like prefer tokens from a connector that doesn't require a lookup delay.

Copy link
Collaborator Author

@ApostaC ApostaC Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree with it!
Just to keep this PR simple, I created a follow-up issue (#24059) for this and will add a follow-up PR once this is merged. So far, the code here would not have any impact since none of the connectors would return None for now.

@ApostaC
Copy link
Collaborator Author

ApostaC commented Sep 4, 2025

@njhill Hey Nick, do we want to merge this anytime soon, or do we still want to wait for anything?

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ApostaC. I've pinged NVIDIA folks again, we might want to hold off a bit longer in case they have comments.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 5, 2025
@YaoJiayi YaoJiayi mentioned this pull request Sep 6, 2025
4 tasks
@ptarasiewiczNV
Copy link
Contributor

LGTM as well!

@simon-mo simon-mo merged commit b4a01aa into vllm-project:main Sep 10, 2025
40 of 43 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants