Skip to content

[Feature] metrics support #3534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Jul 9, 2025
Merged

[Feature] metrics support #3534

merged 99 commits into from
Jul 9, 2025

Conversation

CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented May 9, 2025

Objective

Align with vLLM v1 metrics system and beyond. Here are several key alignments

  • Monotonic Timestamps:
    -- Uses time.perf_counter() for interval calculations (avoids clock drift issues).
  • Metric Types:
    -- Gauges: Active requests, cache usage, etc
    -- Counters: Token totals, request success / failure counts, etc
    -- Histograms: TTFT (Time-To-First-Token), TPOT (Inter-Token Latency), end-to-end latency, etc
  • Metrics Publishing:
    -- CLI logging
    -- Prometheus & Grafana

We only record critical timestamps and events inside the engine process without further processing. Heavy-weight metrics calculations or publishing are separated from the main loop to minimize overhead.

For convenient Grafana visualization and usage, we align with SGLang.

TODO

  • Refactor: After MP engine, things become different ... the global singleton context will be local for each process. Carrying the information from the engine to the async engine seems the most convenient and less error-prone way to do it; otherwise, we may perform IPC frequently.
  • Refactor: 1. Avoid parameter passing (singleton context), 2. Reduce computation overheads (high CPU overheads, but can be solved with MP engine)
  • Refactor: Decouple prometheus_client, only install / import when needed
  • Update: Add user guide
  • Refactor: Reduce messy parameters, pack things into a class
  • Feature: Grafana visualization
  • Feature: Expert information collections (deferred in another PR)
  • Refactor: Minimize the modifications to async engine generate() and engine _async_loop_main()
  • Fix: Use time.perf_counter()

Usage

Start the server with --enable-metrics

lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --enable-metrics
  • Metrics Publishing - Logging
    With --enable-metrics, key metrics (e.g., finished / unfinished / running / waiting requests, token throughputs, cache usage) are printed to the terminal every 10 seconds.
    cli_log

  • Metrics Publishing - Prometheus & Grafana
    -- Raw Metrics
    Access the raw Prometheus metrics via http://localhost:23333/metrics/ .
    You can also curl the metrics endpoint curl http:///localhost:23333/metrics/ to view raw Prometheus results. No extra setups are required for this step.
    prometheus
    -- Prometheus Panel
    Access the Prometheus panel via http://localhost:9090 (9090 is the current default port for the Prometheus panel). You need extra setups to access the Prometheus panel; please check the user guide for details.
    prometheus_panel
    -- Grafana Panel
    Access the Grafana panel via http://localhost:3000 (3000 is the current default port for the Grafana panel). You need extra setups to access the Grafana panel; please check the user guide for details.
    grafana_panel

Request Timeline

The following diagram depicts how we define and calculate time intervals during the request lifecycle, which adheres to vLLM.
timeline

Performance Impacts

  • Conclusion

Tested with Qwen2.5-0.5B / Qwen2.5-7B / Qwen2.5-32B, no obvious performance impacts. (Requires #3627)

Check the following tables for output throughput details. We conducted tests using 1,000 prompts, with input length 1k and output length 1k. Each model was tested three times to reduce the impact of performance fluctuations.

  • QWen2.5-0.5B, TP1
W/O metrics (tokens/s) W metrics (tokens/s)
20387 20555
20341 20877
20746 20771

  • QWen2.5-7B, TP1
W/O metrics (tokens/s) W metrics (tokens/s)
8836 8721
8780 8736
8800 8723

  • QWen2.5-32B, TP2
W/O metrics (tokens/s) W metrics (tokens/s)
3019 3160
3167 3165
3189 3173

Related Issues & PR

CUHKSZzxy added 2 commits May 9, 2025 20:38
Conflicts:
	lmdeploy/messages.py
	lmdeploy/pytorch/engine/engine.py
	lmdeploy/pytorch/engine/engine_instance.py
	lmdeploy/pytorch/messages.py
	lmdeploy/pytorch/paging/scheduler.py
@CUHKSZzxy CUHKSZzxy added the WIP label May 9, 2025
@CUHKSZzxy CUHKSZzxy removed the WIP label May 26, 2025
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review May 26, 2025 13:24
@lvhan028 lvhan028 requested a review from RunningLeon June 19, 2025 13:42
@lvhan028 lvhan028 dismissed stale reviews from grimoire and RunningLeon June 19, 2025 14:08

the design is changed

@CUHKSZzxy CUHKSZzxy removed the WIP label Jul 3, 2025
@lvhan028 lvhan028 mentioned this pull request Jul 7, 2025
- job_name: lmdeploy
static_configs:
- targets:
- '$host_ip:$api_server_port1' # <= Modify this
Copy link
Collaborator

@RunningLeon RunningLeon Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we config all dp server urls in here and show data in grafana board?

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 1e8ce56 into InternLM:main Jul 9, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants