-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Metrics] Add bucket for request_latency
, time_to_first_token
and time_per_output_token
#15202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics] Add bucket for request_latency
, time_to_first_token
and time_per_output_token
#15202
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This does increase the cardinality of the metrics we exposed, but can be reasonable. Letting @markmc decide! |
vllm/engine/metrics.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of anyone today running 16m or 32m long requests to a language model today. However, an 8B model with 128k context could in theory generate about 20 tokens per request per second (at a TPOT of 50ms at ~15 req/s on an H100 which is close to saturation in a 4:1 100 output:1000 output scenario). If there is only one very long output request running and a lot of prefill heavy requests you could imagine completing that in 6400s (128k / 20 output tokens a second) or 106m.
So there is at least some data to suggest that these bucket sizes may even be too small today for today's models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If metric cardinality could be a problem, should we consider doubling or quadruling the bucket size past 1 minute? Perhaps 4 minutes, 16 minutes, and 64 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smarterclayton @RenaultAI
It has been changed :-)
Not sure if this should be a separate PR, but |
I agree with you. I'm also interested in seeing the largest TTFT histogram bucket at least 10x. For a 120k input prompt on the Llama 70B model on a L40s.8x, the TTFT can easily be in the minutes. The number is even larger for reasoning models. Thanks! |
605000d
to
02f26f2
Compare
Thanks @RenaultAI @smarterclayton The |
c9752a0
to
96f09ac
Compare
looks good to me - I'm sure the buckets for these or others metrics will need more tweaking over time, but this is a nice improvement 👍 |
As an example of where further tweaks might be needed - with
i.e. 98% of TPOT measurements are falling into a single bucket ... suggests more granularity required |
In case you want more data. I've been doing large prompt testing, and it looks like at least 20% of all requests fall outside of the largest 2.5-second bucket.
|
96f09ac
to
296a1e5
Compare
request_latency_buckets
request_latency
, time_to_first_token
and time_per_output_token
.
request_latency
, time_to_first_token
and time_per_output_token
.request_latency
, time_to_first_token
and time_per_output_token
Thanks @markmc @RenaultAI Added |
296a1e5
to
704ef75
Compare
…utput_token Signed-off-by: Kay Yan <[email protected]>
704ef75
to
6bcef9d
Compare
HI @robertgshaw2-redhat |
… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>
… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Yang Wang <[email protected]>
… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]>
… `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Mu Huai <[email protected]>
The
request_latency_buckets
histogram is currently capped at 60 seconds. However, many requests to large language models can take significantly longer than this threshold. While p90 and p99 are commonly used to analyze latency distributions in histograms, the top bucket’s cap limits the accuracy of these metrics for higher latencies.To address this, additional buckets should be added to the request_latency_buckets metric, such as [120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0]. This would enable vLLM to support p99 latency estimation for requests up to 8 minutes (480 seconds) out of the box.
At the same time, the PR adds bucket for
request_latency
,time_to_first_token
andtime_per_output_token
.FIX #15167