Grafana
Grafana is deployed as part of the monitoring stack and wired to Prometheus, Loki, CloudWatch, Alertmanager, and (optionally) Tempo by default. It’s exposed at https://grafana.<cluster_fqdn> and authenticated through DEX.
This page collects the day-to-day tasks you may need to perform on Grafana: adjusting configuration via the cluster definition, shipping dashboards, and firing alerts.
Configure Grafana via the cluster definition
All Grafana configuration lives under spec.cluster_monitoring.grafana in your cluster definition. The schema descriptions in .vscode/cluster-definition-schema-eks.yaml are authoritative for per-field details; the recipes below cover the common cases.
High-availability mode (opt-in)
By default Grafana runs as a single-replica StatefulSet backed by SQLite on a 10 GiB EBS volume. That setup is cheap (<$10/mo per cluster) and fine for most use cases, but Grafana is briefly unavailable during rolling upgrades, node drains, or AZ disruptions.
When you need zero-downtime Grafana, opt into HA mode:
spec:
cluster_monitoring:
grafana:
ha:
enabled: trueThat flag swaps the StatefulSet for a multi-replica Deployment with HPA, topology-spread constraints across AZs, and a PodDisruptionBudget. The shared state moves into an external RDS Postgres instance with storage autoscaling, automated backups (retention follows your cluster velero.backup_ttl setting), and an AWS-Secrets-Manager-managed password. Killing one Grafana pod or losing an AZ no longer interrupts access.
Defaults and overrides
All fields are optional; the defaults below match what’s applied when only enabled: true is set:
spec:
cluster_monitoring:
grafana:
ha:
enabled: true
autoscaling:
min: 2 # min replicas (also the initial replica count)
max: 5 # HPA scales up to this on CPU load (70%)
database:
instance_class: db.t4g.micro
multi_az: true
storage_min_gb: 20 # initial allocation; storage autoscales
storage_max_gb: 100 # hard cap on autoscaled storage
backup_window: "02:00-03:00" # UTC, RDS-format hh24:mi-hh24:mi
maintenance_window: "Sun:03:30-Sun:04:30" # UTC, RDS formatCost
Per-cluster breakdown (eu-west-1 on-demand, 720 hrs/mo):
| Item | Spec | 1. Single pod, no external DB | 2. HA deployment + RDS single-AZ | 3. HA + RDS Multi-AZ |
|---|---|---|---|---|
| Grafana pods | option 1: 1 replica; options 2-3: 2-5 replicas (HPA). Each ~100m CPU + 256-512 Mi mem on existing on-demand node capacity | ~$3-5 | ~$6-10 | ~$6-10 |
| Grafana PVC | option 1: 10 GiB gp3 (StatefulSet, SQLite + dashboards/plugins cache); options 2-3: none (state in RDS) | $0.93 | $0.00 | $0.00 |
| RDS instance | db.t4g.micro — $0.018/hr single-AZ, $0.036/hr Multi-AZ × 720h | — | $12.96 | $25.92 |
| RDS storage (gp3) | 20 GiB allocated × $0.115; Multi-AZ physically doubles to 40 GiB | — | $2.30 | $4.60 |
| RDS backup storage (14-day) | ~8 GiB used; within 20 GiB free tier | — | $0.00 | $0.00 |
| RDS Multi-AZ replication | included in Multi-AZ instance price | — | — | $0.00 |
| RDS Performance Insights | 7-day free tier | — | $0.00 | $0.00 |
| RDS log export to CloudWatch | ~0.5 GiB/mo, $0.50/GiB ingest + $0.03/GiB storage | — | ~$0.30 | ~$0.30 |
| Grafana ↔ RDS data transfer | options 2-3: ~1-2 GiB cross-AZ (pods spread across AZs, RDS primary in one AZ) | — | ~$0.01 | ~$0.03 |
| Total | ~$4-6 | ~$22-26 | ~$37-41 |
Background and design rationale (including alternatives we ruled out: in-cluster CNPG, Grafana Cloud, AWS Managed Grafana) in the Roadmap research.
Migrating an existing cluster to the HA setup
Flipping the flag rolls out the new Deployment against an empty Postgres, so anything that previously lived in Grafana’s embedded SQLite database (UI-created dashboards, users, orgs, annotations, alert configurations, library panels, datasources, …) won’t appear in the new Grafana automatically. Skyscrapers can run a one-shot in-cluster migration job that copies the SQLite data over. The script for this is located in our internal tools repository.
If everything you care about already lives in ConfigMaps (dashboards, datasources, alerts) or is recreated on first OAuth login (users, org membership, basic role), the migration step is unnecessary and you can start fresh.
Ship dashboards as ConfigMaps
Beyond creating dashboards directly in the Grafana UI, you can version-control them as ConfigMaps — useful if you want them deployed alongside your application. Grafana picks up any ConfigMap labeled grafana_dashboard (any value will do):
apiVersion: v1
kind: ConfigMap
metadata:
labels:
grafana_dashboard: application
name: grafana-dashboard-myapplication
namespace: mynamespace
data:
grafana-dashboard-myapplication.json: |
{ <Grafana dashboard json> }Make sure dashboards reference "Prometheus" (or one of your declared datasources) as the datasource UID.
You can get inspired by some of the dashboards already deployed in the cluster:
kubectl get configmaps -l grafana_dashboard --all-namespacesAdd custom datasources
Beyond the built-in Prometheus/Loki/CloudWatch/Alertmanager/Tempo datasources, you can add your own in two ways.
As a ConfigMap (declarative)
Similar to dashboards — create a ConfigMap labeled grafana_datasource (any value will do) containing a Grafana datasource provisioning YAML. The Grafana sidecar picks it up and writes it into Grafana’s provisioning directory.
Simple example — point at an Elasticsearch cluster reachable from the EKS cluster (e.g. an AWS OpenSearch domain):
apiVersion: v1
kind: ConfigMap
metadata:
name: elasticsearch-logs-datasource
namespace: mynamespace
labels:
grafana_datasource: "1"
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Elasticsearch-Logs
type: elasticsearch
uid: elasticsearch-logs
url: https://vpc-logs.eu-west-1.es.amazonaws.com
access: proxy
database: "logs-*"
editable: false
jsonData:
esVersion: "8.0.0"
timeField: "@timestamp"
logMessageField: message
logLevelField: levelMore advanced — multi-tenant Loki, where each tenant gets its own datasource with a distinct X-Scope-OrgID header, optionally scoped to a specific Grafana org:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-tenants-datasource
namespace: mynamespace
labels:
grafana_datasource: "1"
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Loki-TenantA
type: loki
uid: loki-tenant-a
url: http://loki-read:3100
access: proxy
orgId: 2
editable: false
jsonData:
httpHeaderName1: X-Scope-OrgID
secureJsonData:
httpHeaderValue1: tenant-aEach entry follows Grafana’s datasource provisioning schema, so any field (jsonData, secureJsonData, orgId, …) is supported.
Important
The datasource sidecar runs as an initContainer only (live reload is incompatible with our OAuth setup), so Grafana must be restarted to pick up a new or changed ConfigMap.
# Default single-replica setup:
kubectl rollout restart statefulset cluster-monitoring-grafana -n infrastructure
# Under HA mode (see above):
kubectl rollout restart deployment cluster-monitoring-grafana -n infrastructureWarning
Values in secureJsonData are stored in the ConfigMap in plaintext. Don’t commit real secrets to git — encrypt sensitive ConfigMaps at rest (sops, sealed-secrets) or add them manually via the UI instead.
Manually via the Grafana UI
Add a datasource under Connections → Data sources in the Grafana UI. Grafana persists this in its database — the StatefulSet’s EBS PVC by default, or the external RDS Postgres when HA mode is enabled — so manually-created datasources survive pod restarts.
This is convenient for quick exploration or for datasources that need real secrets you don’t want in git. Caveat: it’s not declarative — the datasource lives only in this cluster’s Grafana database and won’t appear elsewhere; prefer ConfigMaps for anything long-lived.
Promote specific users to GrafanaAdmin
The DEX-backed [auth.generic_oauth] section is preset with the minimum required keys (enabled, client_id, scopes, OAuth URLs, etc.). To extend it — typically to map OAuth claims onto Grafana roles — use generic_oauth_extras:
spec:
cluster_monitoring:
grafana:
generic_oauth_extras:
role_attribute_path: "contains(email, 'alice@example.com') && 'GrafanaAdmin' || contains(email, 'bob@example.com') && 'GrafanaAdmin' || 'Editor'"
allow_assign_grafana_admin: trueThe expression elevates two specific users to GrafanaAdmin and defaults everyone else to Editor. See Grafana’s generic OAuth reference for the full list of supported keys.
Important
Don’t override keys already set by the stack (enabled, auto_login, client_id, scopes, auth_url, token_url, api_url, skip_org_role_sync) via this field — doing so would break OAuth.
Add an extra OAuth provider
custom_auth_config adds a new grafana.ini auth section alongside the built-in DEX flow (auth.generic_oauth). For the provider’s secret, use custom_auth_secrets, which creates KMS-decrypted env vars in the Grafana pod.
Example for Entra ID (Azure AD):
spec:
cluster_monitoring:
grafana:
custom_auth_config:
auth.azuread:
name: "Entra ID"
enabled: true
auto_login: false
client_id: CLIENT_ID
scopes: "openid email profile groups"
auth_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/authorize"
token_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token"
allowed_organizations: TENANT_ID
use_pkce: true
custom_auth_secrets:
GF_AUTH_AZUREAD_CLIENT_SECRET: a-kms-encrypted-payloadEach value in custom_auth_secrets must be KMS-encrypted with the context k8s_stack=secrets.
Enable feature toggles
Experimental Grafana features are enabled via featureToggles (merged into the [feature_toggles] section of grafana.ini):
spec:
cluster_monitoring:
grafana:
featureToggles:
provisioning: true
kubernetesDashboards: trueSee Grafana’s feature-toggles reference for the full catalog.
Install plugins
spec:
cluster_monitoring:
grafana:
plugins:
- grafana-piechart-panelEnable extra built-in dashboards
Skyscrapers bundles a set of optional dashboards you can enable by name:
spec:
cluster_monitoring:
grafana:
extra_dashboards:
- sqsFire alerts from Grafana
Beyond PrometheusRule, you can define alerts using Grafana’s alerting system, which is useful for multi-datasource queries (for example, correlating Loki logs with Prometheus metrics). The Skyscrapers-managed Grafana ships with a Main Alertmanager Contact Point wired to the cluster’s Alertmanager — use it so your Grafana alerts get the same routing as Prometheus alerts.
Start by building the alert in the Grafana UI, then export it as YAML and ship it as a ConfigMap labeled grafana_alert (any value will do). The following example triggers when unexpected HTTP status codes appear in the public nginx ingress logs:
kind: ConfigMap
apiVersion: v1
metadata:
name: http-errors-nginx-ingress-alert-rule
namespace: infrastructure
labels:
grafana_alert: "logs-nginx-http-errors"
data:
rules.yaml: |
groups:
- name: logs-nginx-http-errors
folder: Ingress Alerts
interval: 5m
rules:
- title: Unexpected HTTP status codes discovered in the nginx-ingress logs
condition: A
data:
- refId: A
queryType: range
relativeTimeRange:
from: 600
to: 0
datasourceUid: P8E80F9AEF21F6940
model:
datasource:
type: loki
uid: P8E80F9AEF21F6940
editorMode: code
expr: |-
sum by () (
count_over_time(
{app_kubernetes_io_name="nginx-ingress"}
| json
| httpRequest_status !=~ "^2\\d\\d$"
[5m])
)
instant: true
intervalMs: 1000
maxDataPoints: 43200
queryType: range
refId: A
noDataState: OK
execErrState: KeepLast
for: 5m
annotations:
summary: Unexpected HTTP status codes are found in the logs of our public nginx ingress
labels:
severity: "critical"
isPaused: false
notification_settings:
receiver: Main AlertmanagerNote
Grafana alerts live in Grafana’s own database: SQLite on the StatefulSet’s PVC by default, or in RDS Postgres under HA mode. Either way they persist across pod restarts.