Skip to content

Conversation

ericehanley
Copy link
Contributor

@ericehanley ericehanley commented Jul 22, 2025

Essential Elements of an Effective PR Description Checklist

  • [ X] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ X] The test plan, such as providing test command.
  • [ X] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Fix issue with auto_tune.sh script where prefix_len and input_len are being passed to benchmark_serving.py incorrectly.

Also update to python3 to be in step with vllm/vllm-openai:latest.

When attempting to execute the following parameters:

TAG=$(date +"%Y_%m_%d_%H_%M")
BASE="/vllm-workspace"
MODEL="google/gemma-3-27b-it"
SYSTEM="GPU"
TP=1
DOWNLOAD_DIR="/vllm-workspace/models"
INPUT_LEN=1500
OUTPUT_LEN=200
MIN_CACHE_HIT_PCT=50
MAX_LATENCY_ALLOWED_MS=10000
NUM_SEQS_LIST="10"
NUM_BATCHED_TOKENS_LIST="512 1024 2048 4096"

Get the following issue in vllm_... logs:

INFO:     Started server process [6434]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:34636 - "GET /health HTTP/1.1" 200 OK
ERROR 07-22 08:51:06 [serving_completion.py:131] Error in preprocessing prompt inputs
ERROR 07-22 08:51:06 [serving_completion.py:131] Traceback (most recent call last):
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion
ERROR 07-22 08:51:06 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(
ERROR 07-22 08:51:06 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 728, in _preprocess_completion
ERROR 07-22 08:51:06 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(
ERROR 07-22 08:51:06 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
ERROR 07-22 08:51:06 [serving_completion.py:131]     result = self.fn(*self.args, **self.kwargs)
ERROR 07-22 08:51:06 [serving_completion.py:131]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 666, in _tokenize_prompt_input_or_inputs
ERROR 07-22 08:51:06 [serving_completion.py:131]     self._normalize_prompt_text_to_input(
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
ERROR 07-22 08:51:06 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)
ERROR 07-22 08:51:06 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:06 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 569, in _validate_input
ERROR 07-22 08:51:06 [serving_completion.py:131]     raise ValueError(
ERROR 07-22 08:51:06 [serving_completion.py:131] ValueError: This model's maximum context length is 1700 tokens. However, you requested 2450 tokens (2250 in the messages, 200 in the completion). Please reduce the length of the messages or completion.
INFO:     127.0.0.1:54458 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO 07-22 08:51:06 [api_server.py:931] Resetting prefix cache with specific None...
INFO 07-22 08:51:06 [block_pool.py:316] Successfully reset prefix cache
INFO:     127.0.0.1:54462 - "POST /reset_prefix_cache HTTP/1.1" 200 OK
ERROR 07-22 08:51:18 [serving_completion.py:131] Error in preprocessing prompt inputs
ERROR 07-22 08:51:18 [serving_completion.py:131] Traceback (most recent call last):
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion
ERROR 07-22 08:51:18 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(
ERROR 07-22 08:51:18 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 728, in _preprocess_completion
ERROR 07-22 08:51:18 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(
ERROR 07-22 08:51:18 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
ERROR 07-22 08:51:18 [serving_completion.py:131]     result = self.fn(*self.args, **self.kwargs)
ERROR 07-22 08:51:18 [serving_completion.py:131]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 666, in _tokenize_prompt_input_or_inputs
ERROR 07-22 08:51:18 [serving_completion.py:131]     self._normalize_prompt_text_to_input(
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input
ERROR 07-22 08:51:18 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)
ERROR 07-22 08:51:18 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-22 08:51:18 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 569, in _validate_input
ERROR 07-22 08:51:18 [serving_completion.py:131]     raise ValueError(
**ERROR 07-22 08:51:18 [serving_completion.py:131] ValueError: This model's maximum context length is 1700 tokens. However, you requested 2450 tokens (2250 in the messages, 200 in the completion). Please reduce the length of the messages or completion.**
INFO:     127.0.0.1:52262 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO 07-22 08:51:18 [launcher.py:80] Shutting down FastAPI HTTP server.
[rank0]:[W722 08:51:19.972474209 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

Per current benchmark_serving.py documentation, input_len and prefix_len need to be calculated differently to pass to benchmark_serving.py as shown in current PR.

Test Plan

Update code to calculate adjusted_input_len and pass to benchmarking.py as shown:

    prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
    adjusted_input_len=$(( INPUT_LEN * (100 - MIN_CACHE_HIT_PCT) / 100))
    python3 benchmarks/benchmark_serving.py \
        --backend vllm \
        --model $MODEL  \
        --dataset-name random \
        --random-input-len $adjusted_input_len \
        --random-output-len $OUTPUT_LEN \
        --ignore-eos \
        --disable-tqdm \
        --request-rate inf \
        --percentile-metrics ttft,tpot,itl,e2el \
        --goodput e2el:$MAX_LATENCY_ALLOWED_MS \
        --num-prompts 1000 \
        --random-prefix-len $prefix_len \
        --port 8004 \
        --profile &> "$bm_log"

    throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
    e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}')
    goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')

    if (( $(echo "$e2el <= $MAX_LATENCY_ALLOWED_MS" | bc -l) )); then
        meet_latency_requirement=1
        request_rate=inf
    fi

    if (( ! meet_latency_requirement )); then
    # start from request-rate as int(throughput) + 1
        request_rate=$((${throughput%.*} + 1))
        while ((request_rate > 0)); do
            profile_index=$((profile_index+1))
            # clear prefix cache
            curl -X POST http://0.0.0.0:8004/reset_prefix_cache
            sleep 5
            bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
            python3 benchmarks/benchmark_serving.py \
                --backend vllm \
                --model $MODEL  \
                --dataset-name random \
                --random-input-len $adjusted_input_len \
                --random-output-len $OUTPUT_LEN \
                --ignore-eos \
                --disable-tqdm \
                --request-rate $request_rate \
                --percentile-metrics ttft,tpot,itl,e2el \
                --goodput e2el:$MAX_LATENCY_ALLOWED_MS \
                --num-prompts 100 \
                --random-prefix-len $prefix_len \
                --port 8004 &> "$bm_log"

Test Result

Logs report successful run with expected cache_hit_rate_pct:

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:39676 - "GET /health HTTP/1.1" 200 OK
INFO:     127.0.0.1:55424 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 07-22 09:39:08 [api_server.py:1012] Starting profiler...
INFO 07-22 09:39:09 [api_server.py:1014] Profiler started.
INFO:     127.0.0.1:55428 - "POST /start_profile HTTP/1.1" 200 OK
INFO:     127.0.0.1:55434 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55440 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55446 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55458 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55466 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:55482 - "POST /v1/completions HTTP/1.1" 200 OK
...
INFO 07-22 09:39:21 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 163.0 tokens/s, Running: 10 reqs, Waiting: 990 reqs, GPU KV cache usage: 19.0%, Prefix cache hit rate: 50.9%
INFO 07-22 09:39:31 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 158.9 tokens/s, Running: 10 reqs, Waiting: 980 reqs, GPU KV cache usage: 18.2%, Prefix cache hit rate: 51.0%
INFO 07-22 09:39:41 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 157.0 tokens/s, Running: 10 reqs, Waiting: 970 reqs, GPU KV cache usage: 17.4%, Prefix cache hit rate: 51.1%
INFO 07-22 09:39:51 [loggers.py:118] Engine 000: Avg prompt throughput: 1500.0 tokens/s, Avg generation throughput: 156.0 tokens/s, Running: 10 reqs, Waiting: 960 reqs, GPU KV cache usage: 16.6%, Prefix cache hit rate: 51.1%
INFO 07-22 09:40:01 [loggers.py:118] Engine 000: Avg prompt throughput: 1499.9 tokens/s, Avg generation throughput: 158.0 tokens/s, Running: 10 reqs, Waiting: 950 reqs, GPU KV cache usage: 15.9%, Prefix cache hit rate: 51.1%

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the performance Performance-related issues label Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in the auto_tune.sh script where the total number of tokens for benchmarking was miscalculated when using prefix caching. The changes to adjust input_len and prefix_len are sound. I've provided one comment to make the calculation more robust against integer division truncation errors. The update to use python3 is also a good improvement.

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this issue! left a nit.

ericehanley and others added 2 commits July 22, 2025 12:39
Make adjusted_input_len calculation more robust

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Eric Hanley <[email protected]>
Copy link
Contributor

@Chenyaaang Chenyaaang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, LGTM.

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yaochengji yaochengji enabled auto-merge (squash) July 22, 2025 19:16
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025
@vllm-bot vllm-bot merged commit 4f76a05 into vllm-project:main Jul 23, 2025
52 of 54 checks passed
zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: qizixi <[email protected]>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: x22x22 <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jinzhen Lin <[email protected]>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Paul Pak <[email protected]>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Diego-Castan <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
… calculations. (vllm-project#21391)

Signed-off-by: Eric Hanley <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants