Using GPU accelerated instances

To leverage GPU accelerated instances in Kubernetes, there are some pre-requisites and steps to follow. These include both configuration of the cluster and the applications docker image.

Infrastructure pre-requisites

A karpenter nodePool with ability to provision GPU accelerated instances, with an appropriate EC2NodeClass with an AMI that uses the correct container runtime. ref: nvidia container runtime. Amazon EKS optimized AMIs are available with the nvidia container runtime pre-configured. You can check the available EKS optimized AMIs [here](amazon-eks-node-al2023-x86_64-nvidia-1.31-v2024112](https://awslabs.github.io/amazon-eks-ami/CHANGELOG/).
A device-plugin for the specific GPU type. eg: NVIDIA GPU device plugin or AWS neuron device plugin. This runs as a DaemonSet on the cluster, and exposes the GPU resources to the cluster.

To test the setup, you can use a simple test pod that will run on a GPU node.

ref: NVIDIA CUDA sample

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
      image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
      resources:
        limits:
          nvidia.com/gpu: 1

Application pre-requisites

The application must be built with the correct CUDA libraries and drivers for the GPU type. This is easiest by using a base image that has the correct libraries installed and configured. For NVIDIA GPUs, the nvidia/cuda images are a good starting point. To run neuron based applications, the AWS Neuron dockerfile reference is a good starting point.

Testing the GPU availability in the application

To test the GPU availability in the application, for CUDA, you can run a script that checks the availability of the GPU device:

import torch
print(torch.cuda.is_available())

For AWS Neuron, you can run a script that checks the availability of the Neuron device:

neuron-top

Monitoring GPU usage

Every cluster with a GPU nodepool automatically runs NVIDIA’s DCGM Exporter as a DaemonSet on the GPU nodes. It collects real per-GPU usage telemetry and ships it into the monitoring stack, so the metrics show up in Prometheus and Grafana alongside everything else. There’s nothing to enable: it’s on by default on GPU clusters and does nothing on clusters without GPU nodes.

Note

This is different from nvidia.com/gpu resource usage (e.g. karpenter_nodepools_usage), which only tells you whether a GPU has been claimed by a pod, not how hard it is actually working. A node can show its GPUs fully allocated while the cards sit nearly idle. Use the DCGM metrics below to see real utilization.

Two ready-made Grafana dashboards ship with it: Kubernetes / GPU Resources / Nodes (per-GPU / per-node view) and Kubernetes / GPU Resources / Pods (utilization and memory broken down by the pod using each GPU). The most useful metrics for right-sizing a GPU nodepool are:

Metric	Meaning
`DCGM_FI_DEV_GPU_UTIL`	GPU utilization (%)
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization (%)
`DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE`	Framebuffer (VRAM) memory used / free (MiB)
`DCGM_FI_DEV_SM_CLOCK`	SM clock frequency (MHz)

You can build your own dashboards and Prometheus alerts on top of these metrics, the same way you would for any other workload.

GPU metrics are also attributed to the pod holding the GPU: the DCGM_FI_DEV_* series carry namespace, pod, and container labels (the “Kubernetes / GPU Resources / Pods” dashboard is built on these). This works when a GPU is dedicated to a single pod, which is the normal case. If you share a single GPU across pods with time-slicing or MPS, DCGM can’t split utilization per pod (an upstream constraint), so those metrics reflect the whole GPU rather than each sharing pod.

Note

The fine-grained profiling metrics (DCGM_FI_PROF_*, e.g. per-engine SM and tensor-core activity) are intentionally not collected, to avoid contending with the GPU health monitoring that already runs on these nodes. The utilization and memory metrics cover the common capacity-planning needs.

Last updated on June 17, 2026

Useful tools Velero backups