[V1][performance] add multi step #26796

chengda-wu · 2025-10-14T11:08:23Z

Purpose

#20727

Although v1 introduced a lighter preprocessing and postprocessing pipeline compared to v0, we still observe notable inefficiencies on certain platforms (e.g., ARM + XPU). Specifically, there remains a latency gap between input preparation and the forward pass launch.

To address these issues, we introduce a multi-step execution strategy that improves efficiency and reduces scheduling overhead. It can be enabled by addtional config, i.e., --additional-config='{"multi_step": Step Num}'. The several key aspects are as follows:

We introduce a multi-step loop to reduce redundant CPU-side operations in model runner, alleviating scheduling overhead between consecutive forward passes.
We schedule multiple model-runner steps once within the engine core and scheduler, modifying postprocessing logic to handle multi-step outputs efficiently.
We enable the multi-step mechanism to support speculative decoding, when the numbers of draft tokens for all requests are the same, like mtp.
We plan to add stop-condition checks within the multi-step loop to prevent unnecessary computation once most requests are completed.

This multi step solution has already been tested a lot in omni-infer(https://gitee.com/omniai/omniinfer/blob/master/omni/adaptors/vllm/worker/npu_model_runner.py#L706). Based on our tests, ITL can be reduced by 1~2ms on ARM + Ascend NPU(910c) for Qwen model.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a multi-step execution strategy to improve performance, which is a valuable addition. However, the current implementation has several critical correctness issues related to how it tracks the number of computed/processed tokens across multiple steps, especially when combined with speculative decoding. The logic incorrectly assumes that the number of processed tokens per step is constant, which leads to incorrect token positions and scheduler state. These issues need to be addressed to ensure the correctness of the model's output.

vllm/v1/core/sched/scheduler.py

vllm/v1/engine/core.py

gemini-code-assist · 2025-10-14T11:10:59Z

vllm/v1/worker/gpu_model_runner.py

+        if self.curr_step > 0:
+            last_step_computed_tokens = np.array(
+                    [len(x) for x in last_step_valid_sampled_token_ids],
+                    dtype=np.int32)
+            self.input_batch.num_computed_tokens_cpu[req_indices] += \
+                    last_step_computed_tokens[req_indices]


The update to self.input_batch.num_computed_tokens_cpu inside _prepare_inputs for multi-step execution (self.curr_step > 0) is incorrect for two reasons:

It increments num_computed_tokens_cpu by the number of accepted tokens (len(x) where x is from last_step_valid_sampled_token_ids), but it should be incremented by the number of processed tokens in the previous step to calculate correct positions for the current step. The number of processed tokens is 1 + num_draft_tokens.

The update self.input_batch.num_computed_tokens_cpu[req_indices] += ... is applied per-token using req_indices, which will incorrectly update the per-request num_computed_tokens_cpu array multiple times for requests that have more than one token in the current step.

These two bugs will lead to incorrect token positions being fed to the model, which is a critical correctness issue.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Zhao Xiaopeng <[email protected]>

njhill · 2025-10-14T21:21:55Z

Thanks @chengda-wu! Have you tried the --async-scheduling option which is now in V1. Is the issue that this doesn't work with XPU?

Signed-off-by: chengda_wu <[email protected]>

mergify · 2025-10-15T02:05:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chengda-wu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: chengda-wu <[email protected]>

chengda-wu · 2025-10-15T02:56:11Z

Thanks @chengda-wu! Have you tried the --async-scheduling option which is now in V1. Is the issue that this doesn't work with XPU?

I tried async scheduling with qwen2.5 just now, and it works well with multi step.

Signed-off-by: chengda_wu <[email protected]>

chengda-wu requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 14, 2025 11:08

mergify bot added the v1 label Oct 14, 2025

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 14, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

chengda-wu mentioned this pull request Oct 14, 2025

[V1][performance] modify the logic of check_stop in update_from_output #25062

Open

5 tasks

chengda-wu added 8 commits October 14, 2025 20:55

Update outputs.py

7ef8498

Signed-off-by: Zhao Xiaopeng <[email protected]>

Update output.py

b9c8225

Signed-off-by: Zhao Xiaopeng <[email protected]>

Update scheduler.py

ace50f8

Signed-off-by: Zhao Xiaopeng <[email protected]>

Update core.py

ab1e1fb

Signed-off-by: Zhao Xiaopeng <[email protected]>

first try

55aaf5b

Signed-off-by: Zhao Xiaopeng <[email protected]>

update codes

56d8a23

Signed-off-by: Zhao Xiaopeng <[email protected]>

update_codes

dd75f32

Signed-off-by: Zhao Xiaopeng <[email protected]>

update codes

d5569fc

Signed-off-by: Zhao Xiaopeng <[email protected]>

XPengZhao force-pushed the main_ms branch from d39c0dd to d5569fc Compare October 14, 2025 13:12

fix format

7d27f64

Signed-off-by: chengda_wu <[email protected]>

mergify bot added the needs-rebase label Oct 15, 2025

Merge branch 'main' into main_ms

8325358

Signed-off-by: chengda-wu <[email protected]>

mergify bot removed the needs-rebase label Oct 15, 2025

chengda-wu added 2 commits October 15, 2025 11:17

Update core.py

e37c5c1

Signed-off-by: chengda_wu <[email protected]>

Update gpu_model_runner.py

3d92f31

Signed-off-by: chengda_wu <[email protected]>

chengda-wu force-pushed the main_ms branch from dc84a57 to 3d92f31 Compare October 15, 2025 03:18

fix kv_connector_output.finish_** is None

76c89c3

Signed-off-by: chengda_wu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][performance] add multi step #26796

[V1][performance] add multi step #26796

Uh oh!

chengda-wu commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

njhill commented Oct 14, 2025

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

chengda-wu commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[V1][performance] add multi step #26796

Are you sure you want to change the base?

[V1][performance] add multi step #26796

Uh oh!

Conversation

chengda-wu commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

njhill commented Oct 14, 2025

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

chengda-wu commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chengda-wu commented Oct 14, 2025 •

edited by github-actions bot

Loading