Skip to content

Conversation

yankay
Copy link
Contributor

@yankay yankay commented Mar 20, 2025

The request_latency_buckets histogram is currently capped at 60 seconds. However, many requests to large language models can take significantly longer than this threshold. While p90 and p99 are commonly used to analyze latency distributions in histograms, the top bucket’s cap limits the accuracy of these metrics for higher latencies.
To address this, additional buckets should be added to the request_latency_buckets metric, such as [120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0]. This would enable vLLM to support p99 latency estimation for requests up to 8 minutes (480 seconds) out of the box.

At the same time, the PR adds bucket for request_latency, time_to_first_token and time_per_output_token .

FIX #15167

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 20, 2025
@simon-mo simon-mo requested a review from markmc March 23, 2025 22:43
@simon-mo
Copy link
Collaborator

This does increase the cardinality of the metrics we exposed, but can be reasonable. Letting @markmc decide!

Copy link
Contributor

@smarterclayton smarterclayton Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of anyone today running 16m or 32m long requests to a language model today. However, an 8B model with 128k context could in theory generate about 20 tokens per request per second (at a TPOT of 50ms at ~15 req/s on an H100 which is close to saturation in a 4:1 100 output:1000 output scenario). If there is only one very long output request running and a lot of prefill heavy requests you could imagine completing that in 6400s (128k / 20 output tokens a second) or 106m.

So there is at least some data to suggest that these bucket sizes may even be too small today for today's models.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If metric cardinality could be a problem, should we consider doubling or quadruling the bucket size past 1 minute? Perhaps 4 minutes, 16 minutes, and 64 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smarterclayton @RenaultAI
It has been changed :-)

@yankay
Copy link
Contributor Author

yankay commented Mar 25, 2025

This does increase the cardinality of the metrics we exposed, but can be reasonable. Letting @markmc decide!

HI @markmc

Would you please help to review it :-)

@smarterclayton
Copy link
Contributor

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

@RenaultAI
Copy link

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

I agree with you. I'm also interested in seeing the largest TTFT histogram bucket at least 10x. For a 120k input prompt on the Llama 70B model on a L40s.8x, the TTFT can easily be in the minutes. The number is even larger for reasoning models. Thanks!

@yankay yankay force-pushed the add-metrics-bucket-for-request-latency branch 3 times, most recently from 605000d to 02f26f2 Compare April 1, 2025 08:46
@yankay
Copy link
Contributor Author

yankay commented Apr 1, 2025

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

I agree with you. I'm also interested in seeing the largest TTFT histogram bucket at least 10x. For a 120k input prompt on the Llama 70B model on a L40s.8x, the TTFT can easily be in the minutes. The number is even larger for reasoning models. Thanks!

Thanks @RenaultAI @smarterclayton

The histogram_time_to_first_token has been added like [20.0, 40.0, 80.0, 160.0,640.0, 2560.0] :-)

@yankay yankay force-pushed the add-metrics-bucket-for-request-latency branch 6 times, most recently from c9752a0 to 96f09ac Compare April 1, 2025 09:37
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2025
@markmc
Copy link
Member

markmc commented Apr 1, 2025

looks good to me - I'm sure the buckets for these or others metrics will need more tweaking over time, but this is a nice improvement 👍

@markmc
Copy link
Member

markmc commented Apr 1, 2025

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:

vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0

i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

@RenaultAI
Copy link

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:

vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0

i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

In case you want more data. I've been doing large prompt testing, and it looks like at least 20% of all requests fall outside of the largest 2.5-second bucket.

vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.3-70B-Instruct"} 62225.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.3-70B-Instruct"} 654371.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.3-70B-Instruct"} 655316.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.3-70B-Instruct"} 655945.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.3-70B-Instruct"} 657994.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.3-70B-Instruct"} 660443.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.3-70B-Instruct"} 928603.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.3-70B-Instruct"} 928838.0
vllm:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.3-70B-Instruct"} 928838.0
vllm:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.3-70B-Instruct"} 527695.3850569725

@yankay yankay force-pushed the add-metrics-bucket-for-request-latency branch from 96f09ac to 296a1e5 Compare April 2, 2025 02:58
@yankay yankay changed the title [Metrics] Add bucket for request_latency_buckets [Metrics] Add bucket for request_latency,time_to_first_token and time_per_output_token Apr 2, 2025
@yankay yankay changed the title [Metrics] Add bucket for request_latency,time_to_first_token and time_per_output_token [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token . Apr 2, 2025
@yankay yankay changed the title [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token . [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token Apr 2, 2025
@yankay
Copy link
Contributor Author

yankay commented Apr 2, 2025

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:

vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0

i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

Thanks @markmc @RenaultAI

Added time_per_output_token_seconds_bucket with optimized latency buckets: [5.0, 7.5, 10.0, 20.0, 40.0, 80.0] for better performance tracking. 🚀

@yankay yankay force-pushed the add-metrics-bucket-for-request-latency branch from 296a1e5 to 704ef75 Compare April 2, 2025 04:46
@yankay yankay force-pushed the add-metrics-bucket-for-request-latency branch from 704ef75 to 6bcef9d Compare April 3, 2025 01:26
@yankay
Copy link
Contributor Author

yankay commented Apr 3, 2025

HI @robertgshaw2-redhat
Would you please help to review it :-)

@robertgshaw2-redhat robertgshaw2-redhat merged commit 86fc232 into vllm-project:main Apr 7, 2025
35 checks passed
lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 7, 2025
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
… `time_per_output_token` (vllm-project#15202)

Signed-off-by: Kay Yan <[email protected]>
Signed-off-by: Yang Wang <[email protected]>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
… `time_per_output_token` (vllm-project#15202)

Signed-off-by: Kay Yan <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: vllm:request_inference_time_seconds_bucket has too few buckets for long inference requests

6 participants