Skip to content

Conversation

chengda-wu
Copy link

@chengda-wu chengda-wu commented Oct 14, 2025

Purpose

#20727

Although v1 introduced a lighter preprocessing and postprocessing pipeline compared to v0, we still observe notable inefficiencies on certain platforms (e.g., ARM + XPU). Specifically, there remains a latency gap between input preparation and the forward pass launch.

To address these issues, we introduce a multi-step execution strategy that improves efficiency and reduces scheduling overhead. It can be enabled by addtional config, i.e., --additional-config='{"multi_step": Step Num}'. The several key aspects are as follows:

  • We introduce a multi-step loop to reduce redundant CPU-side operations in model runner, alleviating scheduling overhead between consecutive forward passes.
  • We schedule multiple model-runner steps once within the engine core and scheduler, modifying postprocessing logic to handle multi-step outputs efficiently.
  • We enable the multi-step mechanism to support speculative decoding, when the numbers of draft tokens for all requests are the same, like mtp.
  • We plan to add stop-condition checks within the multi-step loop to prevent unnecessary computation once most requests are completed.

This multi step solution has already been tested a lot in omni-infer(https://gitee.com/omniai/omniinfer/blob/master/omni/adaptors/vllm/worker/npu_model_runner.py#L706). Based on our tests, ITL can be reduced by 1~2ms on ARM + Ascend NPU(910c) for Qwen model.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-step execution strategy to improve performance, which is a valuable addition. However, the current implementation has several critical correctness issues related to how it tracks the number of computed/processed tokens across multiple steps, especially when combined with speculative decoding. The logic incorrectly assumes that the number of processed tokens per step is constant, which leads to incorrect token positions and scheduler state. These issues need to be addressed to ensure the correctness of the model's output.

Comment on lines 1117 to 1122
if self.curr_step > 0:
last_step_computed_tokens = np.array(
[len(x) for x in last_step_valid_sampled_token_ids],
dtype=np.int32)
self.input_batch.num_computed_tokens_cpu[req_indices] += \
last_step_computed_tokens[req_indices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The update to self.input_batch.num_computed_tokens_cpu inside _prepare_inputs for multi-step execution (self.curr_step > 0) is incorrect for two reasons:

  1. It increments num_computed_tokens_cpu by the number of accepted tokens (len(x) where x is from last_step_valid_sampled_token_ids), but it should be incremented by the number of processed tokens in the previous step to calculate correct positions for the current step. The number of processed tokens is 1 + num_draft_tokens.
  2. The update self.input_batch.num_computed_tokens_cpu[req_indices] += ... is applied per-token using req_indices, which will incorrectly update the per-request num_computed_tokens_cpu array multiple times for requests that have more than one token in the current step.

These two bugs will lead to incorrect token positions being fed to the model, which is a critical correctness issue.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
@njhill
Copy link
Member

njhill commented Oct 14, 2025

Thanks @chengda-wu! Have you tried the --async-scheduling option which is now in V1. Is the issue that this doesn't work with XPU?

Signed-off-by: chengda_wu <[email protected]>
Copy link

mergify bot commented Oct 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chengda-wu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 15, 2025
@mergify mergify bot removed the needs-rebase label Oct 15, 2025
@chengda-wu
Copy link
Author

Thanks @chengda-wu! Have you tried the --async-scheduling option which is now in V1. Is the issue that this doesn't work with XPU?

I tried async scheduling with qwen2.5 just now, and it works well with multi step.

Signed-off-by: chengda_wu <[email protected]>
Signed-off-by: chengda_wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants