Skip to content

[Dashboard] Ray TPU metrics not rendered on dashboard #57829

@ryanaoleary

Description

@ryanaoleary

What happened + What you expected to happen

When running a TPU workload with Prometheus/Grafana enabled in the RayCluster, I'm running into an issue where I see Panel with id X not found for panel numbers 50-53. These are the TPU metrics panels added in this PR: #53898. These metrics are polled from the tpu-device-plugin daemonset when libtpu is running, and are read from the port set in the TPU_DEVICE_PLUGIN_ADDR in Ray.

I can see that the correct environment variables are set in the TPU worker Pod:

(base) ray@ray-serve-llm-tpu-tpu-singlehost-worker-gffj6:~$ set | grep TPU
KUBERAY_GEN_RAY_START_CMD='ray start  --address=ray-serve-llm-tpu-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --memory=40000000000  --metrics-export-port=8080  --num-cpus=8  --resources='\''{"TPU":8}'\'' '
TPU_ACCELERATOR_TYPE=v6e-8
TPU_CHIPS_PER_HOST_BOUNDS=2,4,1
TPU_DEVICE_PLUGIN_ADDR=10.130.0.98:2112
TPU_DEVICE_PLUGIN_HOST_IP=10.130.0.98
TPU_HOST_BOUNDS=1,1,1
TPU_NAME=tpu-singlehost-0
TPU_RUNTIME_METRICS_PORTS=8431,8432,8433,8434,8435,8436,8437,8438
TPU_SKIP_MDS_QUERY=true
TPU_TOPOLOGY=2x4
TPU_TOPOLOGY_ALT=false
TPU_TOPOLOGY_WRAP=false,false,false
TPU_WORKER_HOSTNAMES=localhost
TPU_WORKER_ID=0

I'm also currently running a workload with TPU, and I've verified that libtpu is writing logs so I'd expect the metrics to be available. If I curl the metrics port from inside the TPU pod, I see:

(base) ray@ray-serve-llm-tpu-tpu-singlehost-worker-gffj6:~$ echo $TPU_DEVICE_PLUGIN_ADDR
10.130.0.98:2112
(base) ray@ray-serve-llm-tpu-tpu-singlehost-worker-gffj6:~$ curl $TPU_DEVICE_PLUGIN_ADDR/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.0001663
go_gc_duration_seconds{quantile="0.25"} 0.00026017
go_gc_duration_seconds{quantile="0.5"} 0.00029977
go_gc_duration_seconds{quantile="0.75"} 0.00034882
go_gc_duration_seconds{quantile="1"} 0.00054986
go_gc_duration_seconds_sum 0.042141375
go_gc_duration_seconds_count 135
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 38
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.23.11 X:boringcrypto"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 9.479656e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.263356152e+09
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.535623e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 9.775667e+06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.778168e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 9.479656e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.8975104e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.4180352e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 59999
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 2.7459584e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 4.3155456e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7606641220690265e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 9.835666e+06
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 216000
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 218400
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 464320
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 538560
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.6012496e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 4.240169e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 2.981888e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 2.981888e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 5.6448264e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 29
# HELP memory_bandwidth_utilization Memory bandwidth utilization of the TPU device
# TYPE memory_bandwidth_utilization gauge
memory_bandwidth_utilization{accelerator_id="4371783459163393063-0",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-1",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-2",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-3",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-4",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-5",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-6",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
memory_bandwidth_utilization{accelerator_id="4371783459163393063-7",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
# HELP memory_bandwidth_utilization_node Memory bandwidth utilization of the TPU device per node
# TYPE memory_bandwidth_utilization_node gauge
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-0",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-1",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-2",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-3",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-4",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-5",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-6",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
memory_bandwidth_utilization_node{accelerator_id="4371783459163393063-7",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 40.1
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 6.6523136e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.76065409994e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.307283456e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 1411
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP tensorcore_utilization Tensorcore percent utilization of the TPU device
# TYPE tensorcore_utilization gauge
tensorcore_utilization{accelerator_id="4371783459163393063-0",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-1",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-2",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-3",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-4",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-5",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-6",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
tensorcore_utilization{accelerator_id="4371783459163393063-7",container="ray-worker",make="cloud-tpu",model="tpu-v6e-slice",namespace="default",pod="ray-serve-llm-tpu-tpu-singlehost-worker-gffj6",tpu_topology="2x4"} 0
# HELP tensorcore_utilization_node Tensorcore percent utilization of the TPU device per node
# TYPE tensorcore_utilization_node gauge
tensorcore_utilization_node{accelerator_id="4371783459163393063-0",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-1",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-2",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-3",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-4",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-5",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-6",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0
tensorcore_utilization_node{accelerator_id="4371783459163393063-7",make="cloud-tpu",model="tpu-v6e-slice",tpu_topology="2x4"} 0

the utilization is currently 0 because I don't have a workload running, but I'd expect the panels to at least render and show this. I'm wondering if I'm missing something in the setup, or if there's a bug in the default dashboard panels related to TPUs.

Versions / Dependencies

  • Ray 2.50.0
  • KubeRay v1.4.2

Reproduction script

  1. Install latest version of KubeRay TPU webhook to set TPU_DEVICE_PLUGIN_ADDR by default on Pods:
helm install kuberay-tpu-webhook oci://us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook-helm/kuberay-tpu-webhook --set tpuWebhook.image.tag=v1.2.6-gke.0
  1. Create a RayCluster with prometheus/grafana enabled and TPU:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-serve-llm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-llm:2.46.0-py311-cu124
          resources:
            limits:
              memory: 16Gi
            requests:
              cpu: 8 
              memory: 16Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 44217
            name: as-metrics # autoscaler
          - containerPort: 44227
            name: dash-metrics # dashboard
          env:
          - name: RAY_GRAFANA_IFRAME_HOST
            value: http://127.0.0.1:3000
          - name: RAY_GRAFANA_HOST
            value: http://prometheus-grafana.prometheus-system.svc:80
          - name: RAY_PROMETHEUS_HOST
            value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
          - name: TPU_DEVICE_PLUGIN_HOST_IP
  workerGroupSpecs:
  - replicas: 1
    groupName: tpu-singlehost
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-llm:2.46.0-py311-cu124
          resources:
            limits:
              cpu: "8"
              memory: 40G
              google.com/tpu: "8"
            requests:
              cpu: "8"
              memory: 40G
              google.com/tpu: "8"
          env:
          - name: JAX_PLATFORMS
            value: "tpu"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          cloud.google.com/gke-tpu-topology: 2x4
  1. Run some workload with TPU
  2. Follow the steps to configure prometheus/grafana and view the metrics in the dashboard: https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html#step-10-access-prometheus-web-ui

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogdashboardIssues specific to the Ray DashboardkubernetesobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingquestionJust a question :)triageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions