[P/D] KV Load Failure Recovery/Abort Configuration #26813

wseaton · 2025-10-14T14:31:22Z

Purpose

In some situations an operator may not want to allow KV load failure recovery to result in a local prefill on a decode node at all costs. This provides plumbing to make KV load failures bubble up to the api server as a 502 Bad Gateway, that can be properly handled at the proxy layer in a P/D setup.

We introduce a new FINISHED_ERROR RequestStatus that the API server process can catch to throw the correct semantic error.

Signed-off-by: Will Eaton <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a configurable policy for handling KV cache load failures, allowing operators to choose between recomputing failed blocks or aborting the request. The implementation involves adding a new FinishReason.ERROR and RequestStatus.FINISHED_ERROR, updating the scheduler to handle the new policy, and propagating the error up to the OpenAI API layer to return an appropriate error to the client.

The changes are well-structured. However, I've found one critical issue where an internal data structure (FINISH_REASON_STRINGS) was not updated to reflect the new error state, which will lead to an IndexError and a server crash when an error needs to be reported through the API. Please see the detailed comment.

vllm/v1/engine/__init__.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/engine/__init__.py

Signed-off-by: Will Eaton <[email protected]>

wseaton · 2025-10-14T20:24:53Z

vllm/v1/core/sched/scheduler.py

+            return set()
+
+        # --- Take action based on policy ---
+        if abort:


Undecided on extracting this branch into a helper method

wseaton · 2025-10-14T20:25:34Z

@njhill @NickLucche this is ready for review, also cc @sdavidbd since it interacts with the block level recovery mechanism

Signed-off-by: Will Eaton <[email protected]>

initial pass

09924b5

Signed-off-by: Will Eaton <[email protected]>

wseaton requested review from ApostaC, ProExpertProg, WoosukKwon, aarnphm, alexm-redhat, chaunceyjiang, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 14, 2025 14:31

mergify bot added frontend v1 kv-connector labels Oct 14, 2025

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

vllm/v1/engine/__init__.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 14, 2025

View reviewed changes

vllm/v1/engine/__init__.py Show resolved Hide resolved

wseaton requested review from DarkLight1337 and NickLucche as code owners October 14, 2025 17:46

wseaton added 2 commits October 14, 2025 13:57

fix logging; throw 502; compactness

db3d51e

Signed-off-by: Will Eaton <[email protected]>

add supporting entrypoint and scheduler unit tests

755e628

Signed-off-by: Will Eaton <[email protected]>

wseaton force-pushed the configurable-prefill-recovery branch from 7b72907 to 755e628 Compare October 14, 2025 17:58

error helpers

5cedeb3

Signed-off-by: Will Eaton <[email protected]>

wseaton commented Oct 14, 2025

View reviewed changes

parameterize tests

5247421

Signed-off-by: Will Eaton <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[P/D] KV Load Failure Recovery/Abort Configuration #26813

[P/D] KV Load Failure Recovery/Abort Configuration #26813

wseaton commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

wseaton Oct 14, 2025

Uh oh!

wseaton commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[P/D] KV Load Failure Recovery/Abort Configuration #26813

Are you sure you want to change the base?

[P/D] KV Load Failure Recovery/Abort Configuration #26813

Conversation

wseaton commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

wseaton Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

wseaton commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wseaton commented Oct 14, 2025 •

edited by github-actions bot

Loading