[Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` #15202

yankay · 2025-03-20T09:31:56Z

The request_latency_buckets histogram is currently capped at 60 seconds. However, many requests to large language models can take significantly longer than this threshold. While p90 and p99 are commonly used to analyze latency distributions in histograms, the top bucket’s cap limits the accuracy of these metrics for higher latencies.
To address this, additional buckets should be added to the request_latency_buckets metric, such as [120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0]. This would enable vLLM to support p99 latency estimation for requests up to 8 minutes (480 seconds) out of the box.

At the same time, the PR adds bucket for request_latency, time_to_first_token and time_per_output_token .

FIX #15167

github-actions · 2025-03-20T09:32:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

simon-mo · 2025-03-23T22:43:55Z

This does increase the cardinality of the metrics we exposed, but can be reasonable. Letting @markmc decide!

smarterclayton · 2025-03-24T18:37:48Z

vllm/engine/metrics.py

I'm not aware of anyone today running 16m or 32m long requests to a language model today. However, an 8B model with 128k context could in theory generate about 20 tokens per request per second (at a TPOT of 50ms at ~15 req/s on an H100 which is close to saturation in a 4:1 100 output:1000 output scenario). If there is only one very long output request running and a lot of prefill heavy requests you could imagine completing that in 6400s (128k / 20 output tokens a second) or 106m.

So there is at least some data to suggest that these bucket sizes may even be too small today for today's models.

If metric cardinality could be a problem, should we consider doubling or quadruling the bucket size past 1 minute? Perhaps 4 minutes, 16 minutes, and 64 minutes?

Thanks @smarterclayton @RenaultAI
It has been changed :-)

yankay · 2025-03-25T11:36:39Z

This does increase the cardinality of the metrics we exposed, but can be reasonable. Letting @markmc decide!

HI @markmc

Would you please help to review it :-)

smarterclayton · 2025-03-25T17:56:55Z

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

RenaultAI · 2025-03-28T23:18:30Z

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

I agree with you. I'm also interested in seeing the largest TTFT histogram bucket at least 10x. For a 120k input prompt on the Llama 70B model on a L40s.8x, the TTFT can easily be in the minutes. The number is even larger for reasoning models. Thanks!

yankay · 2025-04-01T08:48:35Z

Not sure if this should be a separate PR, but histogram_time_to_first_token should use request_latency_buckets since it will have the same rough durations as request_queue_time_seconds_bucket. If requests can take minutes, then so can time to first token (IIUC)

I agree with you. I'm also interested in seeing the largest TTFT histogram bucket at least 10x. For a 120k input prompt on the Llama 70B model on a L40s.8x, the TTFT can easily be in the minutes. The number is even larger for reasoning models. Thanks!

Thanks @RenaultAI @smarterclayton

The histogram_time_to_first_token has been added like [20.0, 40.0, 80.0, 160.0,640.0, 2560.0] :-)

markmc · 2025-04-01T14:03:29Z

looks good to me - I'm sure the buckets for these or others metrics will need more tweaking over time, but this is a nice improvement 👍

markmc · 2025-04-01T14:13:44Z

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:

vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0

i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

RenaultAI · 2025-04-01T14:18:26Z

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0
i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

In case you want more data. I've been doing large prompt testing, and it looks like at least 20% of all requests fall outside of the largest 2.5-second bucket.

vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="meta-llama/Llama-3.3-70B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="meta-llama/Llama-3.3-70B-Instruct"} 62225.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="meta-llama/Llama-3.3-70B-Instruct"} 654371.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="meta-llama/Llama-3.3-70B-Instruct"} 655316.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="meta-llama/Llama-3.3-70B-Instruct"} 655945.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="meta-llama/Llama-3.3-70B-Instruct"} 657994.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="meta-llama/Llama-3.3-70B-Instruct"} 660443.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="meta-llama/Llama-3.3-70B-Instruct"} 928603.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="meta-llama/Llama-3.3-70B-Instruct"} 928838.0
vllm:time_per_output_token_seconds_count{model_name="meta-llama/Llama-3.3-70B-Instruct"} 928838.0
vllm:time_per_output_token_seconds_sum{model_name="meta-llama/Llama-3.3-70B-Instruct"} 527695.3850569725

yankay · 2025-04-02T03:03:35Z

As an example of where further tweaks might be needed - with Llama-3.1-8B-Instruct, A100, V1, ngram spec decoding, and sharegpt dataset:
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_per_output_token_seconds_bucket{engine="0",le="0.025",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7152.0
...
vllm:time_per_output_token_seconds_count{engine="0",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7293.0
i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required

Thanks @markmc @RenaultAI

Added time_per_output_token_seconds_bucket with optimized latency buckets: [5.0, 7.5, 10.0, 20.0, 40.0, 80.0] for better performance tracking. 🚀

…utput_token Signed-off-by: Kay Yan <[email protected]>

yankay · 2025-04-03T11:10:24Z

HI @robertgshaw2-redhat
Would you please help to review it :-)

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Yang Wang <[email protected]>

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Mu Huai <[email protected]>

yankay requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners March 20, 2025 09:31

mergify bot added the v1 label Mar 20, 2025

simon-mo requested a review from markmc March 23, 2025 22:43

smarterclayton reviewed Mar 24, 2025

View reviewed changes

smarterclayton mentioned this pull request Mar 24, 2025

[Bug]: vllm:request_inference_time_seconds_bucket has too few buckets for long inference requests #15167

Closed

1 task

yankay force-pushed the add-metrics-bucket-for-request-latency branch 3 times, most recently from 605000d to 02f26f2 Compare April 1, 2025 08:46

yankay force-pushed the add-metrics-bucket-for-request-latency branch 6 times, most recently from c9752a0 to 96f09ac Compare April 1, 2025 09:37

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2025

yankay force-pushed the add-metrics-bucket-for-request-latency branch from 96f09ac to 296a1e5 Compare April 2, 2025 02:58

yankay changed the title ~~[Metrics] Add bucket for request_latency_buckets~~ [Metrics] Add bucket for request_latency,time_to_first_token and time_per_output_token Apr 2, 2025

yankay changed the title ~~[Metrics] Add bucket for request_latency,time_to_first_token and time_per_output_token~~ [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token . Apr 2, 2025

yankay changed the title ~~[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token .~~ [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token Apr 2, 2025

yankay force-pushed the add-metrics-bucket-for-request-latency branch from 296a1e5 to 704ef75 Compare April 2, 2025 04:46

markmc approved these changes Apr 2, 2025

View reviewed changes

chaunceyjiang mentioned this pull request Apr 2, 2025

[Bug]: CI flake - v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output bug Something isn't working ci/build v1 #15944

Closed

1 task

add bucket for metrics request_latency,time_to_first_token,time_per_o…

6bcef9d

…utput_token Signed-off-by: Kay Yan <[email protected]>

yankay force-pushed the add-metrics-bucket-for-request-latency branch from 704ef75 to 6bcef9d Compare April 3, 2025 01:26

robertgshaw2-redhat approved these changes Apr 7, 2025

View reviewed changes

robertgshaw2-redhat merged commit 86fc232 into vllm-project:main Apr 7, 2025
35 checks passed

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 7, 2025

[Metrics] Add bucket for request_latency, time_to_first_token and…

7e1ce00

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Metrics] Add bucket for request_latency, time_to_first_token and…

86167bd

… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>

Uh oh!

[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token #15202

[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token #15202

Uh oh!

Conversation

yankay commented Mar 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2025

Uh oh!

simon-mo commented Mar 23, 2025

Uh oh!

smarterclayton Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RenaultAI Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

yankay Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

yankay commented Mar 25, 2025

Uh oh!

smarterclayton commented Mar 25, 2025

Uh oh!

RenaultAI commented Mar 28, 2025

Uh oh!

yankay commented Apr 1, 2025

Uh oh!

markmc commented Apr 1, 2025

Uh oh!

markmc commented Apr 1, 2025

Uh oh!

RenaultAI commented Apr 1, 2025

Uh oh!

yankay commented Apr 2, 2025

Uh oh!

yankay commented Apr 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` #15202

[Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` #15202

yankay commented Mar 20, 2025 •

edited by github-actions bot

Loading

smarterclayton Mar 24, 2025 •

edited

Loading