See real GPU utilization in Grafana

Until now, the only GPU signal you had was whether a GPU was claimed by a pod, not how hard it was actually working. That meant a node could happily report its GPUs “fully used” while the cards sat nearly idle, which makes it pretty hard to know if your GPU nodepool is the right size. We’ve fixed that.

Every cluster now runs NVIDIA’s DCGM Exporter that automatically schedules pods on GPU nodes. It collects real per-GPU usage telemetry, utilization, GPU memory used, and clock speeds, and ships it straight into your existing Prometheus and Grafana.

What you get

Two ready-made Grafana dashboards: Kubernetes / GPU Resources / Nodes (per-GPU / per-node) and Kubernetes / GPU Resources / Pods (utilization and memory broken down by the pod using each GPU). Plus metrics you can build your own dashboards and alerts on, such as:

DCGM_FI_DEV_GPU_UTIL (GPU utilization)
DCGM_FI_DEV_MEM_COPY_UTIL (memory bandwidth utilization)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE (GPU memory used / free)

Resources

Using GPU accelerated instances, see the “Monitoring GPU usage” section.
Our monitoring stack and Grafana documentation.