-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[KV Connector] More async support for get_num_new_matched_tokens
#23620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ApostaC <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an important enhancement to the KV connector API by allowing get_num_new_matched_tokens
to return None
, indicating an asynchronous lookup is in progress. This change will help prevent the scheduler from being blocked by remote KV cache lookups. The implementation in the scheduler to handle this new None
state appears correct. However, I've found a critical issue in MultiConnector
where the change in the return type of get_num_new_matched_tokens
was not correctly handled in its implementation, which will lead to a TypeError
. Please see my detailed comment.
Signed-off-by: ApostaC <[email protected]>
@njhill @robertgshaw2-redhat This is the one I mentioned today. Relatively simple change just to open up the new semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ApostaC, LGTM.
Would be good to get @ryanolson's thoughts too
# If there is a connector still looking up the matches, | ||
# we return None to indicate that we are not done yet. | ||
if toks is None: | ||
return (None, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reasonable but in theory you might prefer a different policy like prefer tokens from a connector that doesn't require a lookup delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree with it!
Just to keep this PR simple, I created a follow-up issue (#24059) for this and will add a follow-up PR once this is merged. So far, the code here would not have any impact since none of the connectors would return None
for now.
Signed-off-by: ApostaC <[email protected]>
@njhill Hey Nick, do we want to merge this anytime soon, or do we still want to wait for anything? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @ApostaC. I've pinged NVIDIA folks again, we might want to hold off a bit longer in case they have comments.
LGTM as well! |
…llm-project#23620) Signed-off-by: ApostaC <[email protected]>
…llm-project#23620) Signed-off-by: ApostaC <[email protected]>
…llm-project#23620) Signed-off-by: ApostaC <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Inspired by the Dynamo team, this PR adds support for the "async remote lookup" into the connector API.
Purpose
For remote KV cache connectors, it may need some RPC calls to look up the number of remote hit tokens. However, this may slow down the scheduler loop and cause performance degradation on prefill heavy workloads.
This PR gets the first step towards addressing this problem -- it introduces a new semantics of "try again" in
get_num_new_matched_tokens
function.When calling the function, it may return
None
as the number of matched tokens, indicating that the connector needs more time to look up the hit tokens for this request. In this case, the scheduler should schedule other requests first.Test Plan
There is a PoC showing how much scheduling overhead this PR can reduce in #23622
Test Result
see #23622
Follow-ups
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.