[BugFix] Update python to python3 calls for image; fix prefix & input calculations. #21391

ericehanley · 2025-07-22T16:49:21Z

Essential Elements of an Effective PR Description Checklist

[ X] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[ X] The test plan, such as providing test command.
[ X] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Fix issue with auto_tune.sh script where prefix_len and input_len are being passed to benchmark_serving.py incorrectly.

Also update to python3 to be in step with vllm/vllm-openai:latest.

When attempting to execute the following parameters:

TAG=$(date +"%Y_%m_%d_%H_%M")
BASE="/vllm-workspace"
MODEL="google/gemma-3-27b-it"
SYSTEM="GPU"
TP=1
DOWNLOAD_DIR="/vllm-workspace/models"
INPUT_LEN=1500
OUTPUT_LEN=200
MIN_CACHE_HIT_PCT=50
MAX_LATENCY_ALLOWED_MS=10000
NUM_SEQS_LIST="10"
NUM_BATCHED_TOKENS_LIST="512 1024 2048 4096"

Get the following issue in vllm_... logs:

INFO:     Started server process [6434]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:34636 - "GET /health HTTP/1.1" 200 OK
ERROR 07-22 08:51:06 [serving_completion.py:131] Error in preprocessing prompt inputs
ERROR 07-22 08:51:06 [serving_completion.py:131] Traceback (most recent call last):
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion
ERROR 07-22 08:51:06 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(
ERROR 07-22 08:51:06 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 728, in _preprocess_completion
ERROR 07-22 08:51:06 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(
ERROR 07-22 08:51:06 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
ERROR 07-22 08:51:06 [serving_completion.py:131]     result = self.fn(*self.args, **self.kwargs)
ERROR 07-22 08:51:06 [serving_completion.py:131]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 666, in _tokenize_prompt_input_or_inputs
ERROR 07-22 08:51:06 [serving_completion.py:131]     self._normalize_prompt_text_to_input(
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
ERROR 07-22 08:51:06 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)
ERROR 07-22 08:51:06 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 569, in _validate_input
ERROR 07-22 08:51:06 [serving_completion.py:131]     raise ValueError(
ERROR 07-22 08:51:06 [serving_completion.py:131] ValueError: This model's maximum context length is 1700 tokens. However, you requested 2450 tokens (2250 in the messages, 200 in the completion). Please reduce the length of the messages or completion.
INFO:     127.0.0.1:54458 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO 07-22 08:51:06 [api_server.py:931] Resetting prefix cache with specific None...
INFO 07-22 08:51:06 [block_pool.py:316] Successfully reset prefix cache
INFO:     127.0.0.1:54462 - "POST /reset_prefix_cache HTTP/1.1" 200 OK
ERROR 07-22 08:51:18 [serving_completion.py:131] Error in preprocessing prompt inputs
ERROR 07-22 08:51:18 [serving_completion.py:131] Traceback (most recent call last):
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion
ERROR 07-22 08:51:18 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(
ERROR 07-22 08:51:18 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 728, in _preprocess_completion
ERROR 07-22 08:51:18 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(
ERROR 07-22 08:51:18 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
ERROR 07-22 08:51:18 [serving_completion.py:131]     result = self.fn(*self.args, **self.kwargs)
ERROR 07-22 08:51:18 [serving_completion.py:131]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 666, in _tokenize_prompt_input_or_inputs
ERROR 07-22 08:51:18 [serving_completion.py:131]     self._normalize_prompt_text_to_input(
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
ERROR 07-22 08:51:18 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)
ERROR 07-22 08:51:18 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 569, in _validate_input
ERROR 07-22 08:51:18 [serving_completion.py:131]     raise ValueError(
**ERROR 07-22 08:51:18 [serving_completion.py:131] ValueError: This model's maximum context length is 1700 tokens. However, you requested 2450 tokens (2250 in the messages, 200 in the completion). Please reduce the length of the messages or completion.**
INFO:     127.0.0.1:52262 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO 07-22 08:51:18 [launcher.py:80] Shutting down FastAPI HTTP server.
[rank0]:[W722 08:51:19.972474209 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

Per current benchmark_serving.py documentation, input_len and prefix_len need to be calculated differently to pass to benchmark_serving.py as shown in current PR.

Test Plan

Update code to calculate adjusted_input_len and pass to benchmarking.py as shown:

    prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
    adjusted_input_len=$(( INPUT_LEN * (100 - MIN_CACHE_HIT_PCT) / 100))
    python3 benchmarks/benchmark_serving.py \
        --backend vllm \
        --model $MODEL  \
        --dataset-name random \
        --random-input-len $adjusted_input_len \
        --random-output-len $OUTPUT_LEN \
        --ignore-eos \
        --disable-tqdm \
        --request-rate inf \
        --percentile-metrics ttft,tpot,itl,e2el \
        --goodput e2el:$MAX_LATENCY_ALLOWED_MS \
        --num-prompts 1000 \
        --random-prefix-len $prefix_len \
        --port 8004 \
        --profile &> "$bm_log"

    throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
    e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}')
    goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')

    if (( $(echo "$e2el <= $MAX_LATENCY_ALLOWED_MS" | bc -l) )); then
        meet_latency_requirement=1
        request_rate=inf
    fi

    if (( ! meet_latency_requirement )); then
    # start from request-rate as int(throughput) + 1
        request_rate=$((${throughput%.*} + 1))
        while ((request_rate > 0)); do
            profile_index=$((profile_index+1))
            # clear prefix cache
            curl -X POST http://0.0.0.0:8004/reset_prefix_cache
            sleep 5
            bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
            python3 benchmarks/benchmark_serving.py \
                --backend vllm \
                --model $MODEL  \
                --dataset-name random \
                --random-input-len $adjusted_input_len \
                --random-output-len $OUTPUT_LEN \
                --ignore-eos \
                --disable-tqdm \
                --request-rate $request_rate \
                --percentile-metrics ttft,tpot,itl,e2el \
                --goodput e2el:$MAX_LATENCY_ALLOWED_MS \
                --num-prompts 100 \
                --random-prefix-len $prefix_len \
                --port 8004 &> "$bm_log"

Test Result

Logs report successful run with expected cache_hit_rate_pct:

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:39676 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:55424 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 07-22 09:39:08 [api_server.py:1012] Starting profiler...
INFO 07-22 09:39:09 [api_server.py:1014] Profiler started.
INFO:     127.0.0.1:55428 - "POST /start_profile HTTP/1.1" 200 OK
INFO:     127.0.0.1:55434 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55440 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55446 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55458 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55466 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55482 - "POST /v1/completions HTTP/1.1" 200 OK
...
INFO 07-22 09:39:21 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 163.0 tokens/s, Running: 10 reqs, Waiting: 990 reqs, GPU KV cache usage: 19.0%, Prefix cache hit rate: 50.9%
INFO 07-22 09:39:31 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 158.9 tokens/s, Running: 10 reqs, Waiting: 980 reqs, GPU KV cache usage: 18.2%, Prefix cache hit rate: 51.0%
INFO 07-22 09:39:41 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 157.0 tokens/s, Running: 10 reqs, Waiting: 970 reqs, GPU KV cache usage: 17.4%, Prefix cache hit rate: 51.1%
INFO 07-22 09:39:51 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 156.0 tokens/s, Running: 10 reqs, Waiting: 960 reqs, GPU KV cache usage: 16.6%, Prefix cache hit rate: 51.1%
INFO 07-22 09:40:01 [loggers.py:118] Engine 000: Avg prompt throughput: 1499.9 tokens/s, Avg generation throughput: 158.0 tokens/s, Running: 10 reqs, Waiting: 950 reqs, GPU KV cache usage: 15.9%, Prefix cache hit rate: 51.1%

(Optional) Documentation Update

github-actions · 2025-07-22T16:49:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request correctly fixes a bug in the auto_tune.sh script where the total number of tokens for benchmarking was miscalculated when using prefix caching. The changes to adjust input_len and prefix_len are sound. I've provided one comment to make the calculation more robust against integer division truncation errors. The update to use python3 is also a good improvement.

benchmarks/auto_tune/auto_tune.sh

yaochengji

Thanks for catching this issue! left a nit.

benchmarks/auto_tune/auto_tune.sh

… calculations Signed-off-by: Eric Hanley <[email protected]>

Make adjusted_input_len calculation more robust Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Eric Hanley <[email protected]>

Chenyaaang

Thanks for the fix, LGTM.

yaochengji

LGTM!

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: qizixi <[email protected]>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 <[email protected]>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jinzhen Lin <[email protected]>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Paul Pak <[email protected]>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Diego-Castan <[email protected]>

… calculations. (vllm-project#21391) Signed-off-by: Eric Hanley <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mergify bot added the performance Performance-related issues label Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

benchmarks/auto_tune/auto_tune.sh Outdated Show resolved Hide resolved

yaochengji reviewed Jul 22, 2025

View reviewed changes

benchmarks/auto_tune/auto_tune.sh Outdated Show resolved Hide resolved

ericehanley and others added 2 commits July 22, 2025 12:39

[BugFix] Update python to python3 calls for image; fix prefix & input…

c6e74df

… calculations Signed-off-by: Eric Hanley <[email protected]>

Update benchmarks/auto_tune/auto_tune.sh

4a2a28f

Make adjusted_input_len calculation more robust Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Eric Hanley <[email protected]>

ericehanley force-pushed the main branch from 2308edf to 4a2a28f Compare July 22, 2025 17:40

Chenyaaang approved these changes Jul 22, 2025

View reviewed changes

yaochengji approved these changes Jul 22, 2025

View reviewed changes

yaochengji enabled auto-merge (squash) July 22, 2025 19:16

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025

vllm-bot merged commit 4f76a05 into vllm-project:main Jul 23, 2025
52 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Update python to python3 calls for image; fix prefix & input calculations. #21391

[BugFix] Update python to python3 calls for image; fix prefix & input calculations. #21391

Uh oh!

ericehanley commented Jul 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Chenyaaang left a comment

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[BugFix] Update python to python3 calls for image; fix prefix & input calculations. #21391

[BugFix] Update python to python3 calls for image; fix prefix & input calculations. #21391

Uh oh!

Conversation

ericehanley commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Chenyaaang left a comment

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericehanley commented Jul 22, 2025 •

edited by github-actions bot

Loading