-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1][performance] add multi step #26796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a multi-step execution strategy to improve performance, which is a valuable addition. However, the current implementation has several critical correctness issues related to how it tracks the number of computed/processed tokens across multiple steps, especially when combined with speculative decoding. The logic incorrectly assumes that the number of processed tokens per step is constant, which leads to incorrect token positions and scheduler state. These issues need to be addressed to ensure the correctness of the model's output.
vllm/v1/worker/gpu_model_runner.py
Outdated
if self.curr_step > 0: | ||
last_step_computed_tokens = np.array( | ||
[len(x) for x in last_step_valid_sampled_token_ids], | ||
dtype=np.int32) | ||
self.input_batch.num_computed_tokens_cpu[req_indices] += \ | ||
last_step_computed_tokens[req_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update to self.input_batch.num_computed_tokens_cpu
inside _prepare_inputs
for multi-step execution (self.curr_step > 0
) is incorrect for two reasons:
- It increments
num_computed_tokens_cpu
by the number of accepted tokens (len(x)
wherex
is fromlast_step_valid_sampled_token_ids
), but it should be incremented by the number of processed tokens in the previous step to calculate correct positions for the current step. The number of processed tokens is1 + num_draft_tokens
. - The update
self.input_batch.num_computed_tokens_cpu[req_indices] += ...
is applied per-token usingreq_indices
, which will incorrectly update the per-requestnum_computed_tokens_cpu
array multiple times for requests that have more than one token in the current step.
These two bugs will lead to incorrect token positions being fed to the model, which is a critical correctness issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Signed-off-by: Zhao Xiaopeng <[email protected]>
Thanks @chengda-wu! Have you tried the |
Signed-off-by: chengda_wu <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: chengda-wu <[email protected]>
I tried async scheduling with qwen2.5 just now, and it works well with multi step. |
Signed-off-by: chengda_wu <[email protected]>
Signed-off-by: chengda_wu <[email protected]>
Signed-off-by: chengda_wu <[email protected]>
Purpose
#20727
Although v1 introduced a lighter preprocessing and postprocessing pipeline compared to v0, we still observe notable inefficiencies on certain platforms (e.g., ARM + XPU). Specifically, there remains a latency gap between input preparation and the forward pass launch.
To address these issues, we introduce a multi-step execution strategy that improves efficiency and reduces scheduling overhead. It can be enabled by addtional config, i.e., --additional-config='{"multi_step": Step Num}'. The several key aspects are as follows:
This multi step solution has already been tested a lot in omni-infer(https://gitee.com/omniai/omniinfer/blob/master/omni/adaptors/vllm/worker/npu_model_runner.py#L706). Based on our tests, ITL can be reduced by 1~2ms on ARM + Ascend NPU(910c) for Qwen model.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.