Grafana

Grafana is deployed as part of the monitoring stack and wired to Prometheus, Loki, CloudWatch, Alertmanager, and (optionally) Tempo by default. It’s exposed at https://grafana.<cluster_fqdn> and authenticated through DEX.

This page collects the day-to-day tasks you may need to perform on Grafana: adjusting configuration via the cluster definition, shipping dashboards, and firing alerts.

Configure Grafana via the cluster definition

All Grafana configuration lives under spec.cluster_monitoring.grafana in your cluster definition. The schema descriptions in .vscode/cluster-definition-schema-eks.yaml are authoritative for per-field details; the recipes below cover the common cases.

High-availability mode (opt-in)

By default Grafana runs as a single-replica StatefulSet backed by SQLite on a 10 GiB EBS volume. That setup is cheap (<$10/mo per cluster) and fine for most use cases, but Grafana is briefly unavailable during rolling upgrades, node drains, or AZ disruptions.

When you need zero-downtime Grafana, opt into HA mode:

spec:
  cluster_monitoring:
    grafana:
      ha:
        enabled: true

That flag swaps the StatefulSet for a multi-replica Deployment with HPA, topology-spread constraints across AZs, and a PodDisruptionBudget. The shared state moves into an external RDS Postgres instance with storage autoscaling, automated backups (retention follows your cluster velero.backup_ttl setting), and an AWS-Secrets-Manager-managed password. Killing one Grafana pod or losing an AZ no longer interrupts access.

Defaults and overrides

All fields are optional; the defaults below match what’s applied when only enabled: true is set:

spec:
  cluster_monitoring:
    grafana:
      ha:
        enabled: true
        autoscaling:
          min: 2          # min replicas (also the initial replica count)
          max: 5          # HPA scales up to this on CPU load (70%)
        database:
          instance_class: db.t4g.micro
          multi_az: true
          storage_min_gb: 20                         # initial allocation; storage autoscales
          storage_max_gb: 100                        # hard cap on autoscaled storage
          backup_window: "02:00-03:00"               # UTC, RDS-format hh24:mi-hh24:mi
          maintenance_window: "Sun:03:30-Sun:04:30"  # UTC, RDS format

Cost

Per-cluster breakdown (eu-west-1 on-demand, 720 hrs/mo):

ItemSpec1. Single pod, no external DB2. HA deployment + RDS single-AZ3. HA + RDS Multi-AZ
Grafana podsoption 1: 1 replica; options 2-3: 2-5 replicas (HPA). Each ~100m CPU + 256-512 Mi mem on existing on-demand node capacity~$3-5~$6-10~$6-10
Grafana PVCoption 1: 10 GiB gp3 (StatefulSet, SQLite + dashboards/plugins cache); options 2-3: none (state in RDS)$0.93$0.00$0.00
RDS instancedb.t4g.micro — $0.018/hr single-AZ, $0.036/hr Multi-AZ × 720h$12.96$25.92
RDS storage (gp3)20 GiB allocated × $0.115; Multi-AZ physically doubles to 40 GiB$2.30$4.60
RDS backup storage (14-day)~8 GiB used; within 20 GiB free tier$0.00$0.00
RDS Multi-AZ replicationincluded in Multi-AZ instance price$0.00
RDS Performance Insights7-day free tier$0.00$0.00
RDS log export to CloudWatch~0.5 GiB/mo, $0.50/GiB ingest + $0.03/GiB storage~$0.30~$0.30
Grafana ↔ RDS data transferoptions 2-3: ~1-2 GiB cross-AZ (pods spread across AZs, RDS primary in one AZ)~$0.01~$0.03
Total~$4-6~$22-26~$37-41

Background and design rationale (including alternatives we ruled out: in-cluster CNPG, Grafana Cloud, AWS Managed Grafana) in the Roadmap research.

Migrating an existing cluster to the HA setup

Flipping the flag rolls out the new Deployment against an empty Postgres, so anything that previously lived in Grafana’s embedded SQLite database (UI-created dashboards, users, orgs, annotations, alert configurations, library panels, datasources, …) won’t appear in the new Grafana automatically. Skyscrapers can run a one-shot in-cluster migration job that copies the SQLite data over. The script for this is located in our internal tools repository.

If everything you care about already lives in ConfigMaps (dashboards, datasources, alerts) or is recreated on first OAuth login (users, org membership, basic role), the migration step is unnecessary and you can start fresh.

Ship dashboards as ConfigMaps

Beyond creating dashboards directly in the Grafana UI, you can version-control them as ConfigMaps — useful if you want them deployed alongside your application. Grafana picks up any ConfigMap labeled grafana_dashboard (any value will do):

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    grafana_dashboard: application
  name: grafana-dashboard-myapplication
  namespace: mynamespace
data:
  grafana-dashboard-myapplication.json: |
    { <Grafana dashboard json> }

Make sure dashboards reference "Prometheus" (or one of your declared datasources) as the datasource UID.

You can get inspired by some of the dashboards already deployed in the cluster:

kubectl get configmaps -l grafana_dashboard --all-namespaces

Add custom datasources

Beyond the built-in Prometheus/Loki/CloudWatch/Alertmanager/Tempo datasources, you can add your own in two ways.

As a ConfigMap (declarative)

Similar to dashboards — create a ConfigMap labeled grafana_datasource (any value will do) containing a Grafana datasource provisioning YAML. The Grafana sidecar picks it up and writes it into Grafana’s provisioning directory.

Simple example — point at an Elasticsearch cluster reachable from the EKS cluster (e.g. an AWS OpenSearch domain):

apiVersion: v1
kind: ConfigMap
metadata:
  name: elasticsearch-logs-datasource
  namespace: mynamespace
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Elasticsearch-Logs
        type: elasticsearch
        uid: elasticsearch-logs
        url: https://vpc-logs.eu-west-1.es.amazonaws.com
        access: proxy
        database: "logs-*"
        editable: false
        jsonData:
          esVersion: "8.0.0"
          timeField: "@timestamp"
          logMessageField: message
          logLevelField: level

More advanced — multi-tenant Loki, where each tenant gets its own datasource with a distinct X-Scope-OrgID header, optionally scoped to a specific Grafana org:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-tenants-datasource
  namespace: mynamespace
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Loki-TenantA
        type: loki
        uid: loki-tenant-a
        url: http://loki-read:3100
        access: proxy
        orgId: 2
        editable: false
        jsonData:
          httpHeaderName1: X-Scope-OrgID
        secureJsonData:
          httpHeaderValue1: tenant-a

Each entry follows Grafana’s datasource provisioning schema, so any field (jsonData, secureJsonData, orgId, …) is supported.

Important

The datasource sidecar runs as an initContainer only (live reload is incompatible with our OAuth setup), so Grafana must be restarted to pick up a new or changed ConfigMap.

# Default single-replica setup:
kubectl rollout restart statefulset cluster-monitoring-grafana -n infrastructure
# Under HA mode (see above):
kubectl rollout restart deployment cluster-monitoring-grafana -n infrastructure

Warning

Values in secureJsonData are stored in the ConfigMap in plaintext. Don’t commit real secrets to git — encrypt sensitive ConfigMaps at rest (sops, sealed-secrets) or add them manually via the UI instead.

Manually via the Grafana UI

Add a datasource under Connections → Data sources in the Grafana UI. Grafana persists this in its database — the StatefulSet’s EBS PVC by default, or the external RDS Postgres when HA mode is enabled — so manually-created datasources survive pod restarts.

This is convenient for quick exploration or for datasources that need real secrets you don’t want in git. Caveat: it’s not declarative — the datasource lives only in this cluster’s Grafana database and won’t appear elsewhere; prefer ConfigMaps for anything long-lived.

Promote specific users to GrafanaAdmin

The DEX-backed [auth.generic_oauth] section is preset with the minimum required keys (enabled, client_id, scopes, OAuth URLs, etc.). To extend it — typically to map OAuth claims onto Grafana roles — use generic_oauth_extras:

spec:
  cluster_monitoring:
    grafana:
      generic_oauth_extras:
        role_attribute_path: "contains(email, 'alice@example.com') && 'GrafanaAdmin' || contains(email, 'bob@example.com') && 'GrafanaAdmin' || 'Editor'"
        allow_assign_grafana_admin: true

The expression elevates two specific users to GrafanaAdmin and defaults everyone else to Editor. See Grafana’s generic OAuth reference for the full list of supported keys.

Important

Don’t override keys already set by the stack (enabled, auto_login, client_id, scopes, auth_url, token_url, api_url, skip_org_role_sync) via this field — doing so would break OAuth.

Add an extra OAuth provider

custom_auth_config adds a new grafana.ini auth section alongside the built-in DEX flow (auth.generic_oauth). For the provider’s secret, use custom_auth_secrets, which creates KMS-decrypted env vars in the Grafana pod.

Example for Entra ID (Azure AD):

spec:
  cluster_monitoring:
    grafana:
      custom_auth_config:
        auth.azuread:
          name: "Entra ID"
          enabled: true
          auto_login: false
          client_id: CLIENT_ID
          scopes: "openid email profile groups"
          auth_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/authorize"
          token_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token"
          allowed_organizations: TENANT_ID
          use_pkce: true
      custom_auth_secrets:
        GF_AUTH_AZUREAD_CLIENT_SECRET: a-kms-encrypted-payload

Each value in custom_auth_secrets must be KMS-encrypted with the context k8s_stack=secrets.

Enable feature toggles

Experimental Grafana features are enabled via featureToggles (merged into the [feature_toggles] section of grafana.ini):

spec:
  cluster_monitoring:
    grafana:
      featureToggles:
        provisioning: true
        kubernetesDashboards: true

See Grafana’s feature-toggles reference for the full catalog.

Install plugins

spec:
  cluster_monitoring:
    grafana:
      plugins:
        - grafana-piechart-panel

Enable extra built-in dashboards

Skyscrapers bundles a set of optional dashboards you can enable by name:

spec:
  cluster_monitoring:
    grafana:
      extra_dashboards:
        - sqs

Fire alerts from Grafana

Beyond PrometheusRule, you can define alerts using Grafana’s alerting system, which is useful for multi-datasource queries (for example, correlating Loki logs with Prometheus metrics). The Skyscrapers-managed Grafana ships with a Main Alertmanager Contact Point wired to the cluster’s Alertmanager — use it so your Grafana alerts get the same routing as Prometheus alerts.

Start by building the alert in the Grafana UI, then export it as YAML and ship it as a ConfigMap labeled grafana_alert (any value will do). The following example triggers when unexpected HTTP status codes appear in the public nginx ingress logs:

kind: ConfigMap
apiVersion: v1
metadata:
  name: http-errors-nginx-ingress-alert-rule
  namespace: infrastructure
  labels:
    grafana_alert: "logs-nginx-http-errors"
data:
  rules.yaml: |
    groups:
      - name: logs-nginx-http-errors
          folder: Ingress Alerts
          interval: 5m
          rules:
            - title: Unexpected HTTP status codes discovered in the nginx-ingress logs
              condition: A
              data:
                - refId: A
                  queryType: range
                  relativeTimeRange:
                    from: 600
                    to: 0
                  datasourceUid: P8E80F9AEF21F6940
                  model:
                    datasource:
                        type: loki
                        uid: P8E80F9AEF21F6940
                    editorMode: code
                    expr: |-
                        sum by () (
                          count_over_time(
                            {app_kubernetes_io_name="nginx-ingress"}
                            | json
                            | httpRequest_status !=~ "^2\\d\\d$"
                          [5m])
                        )
                    instant: true
                    intervalMs: 1000
                    maxDataPoints: 43200
                    queryType: range
                    refId: A
              noDataState: OK
              execErrState: KeepLast
              for: 5m
              annotations:
                summary: Unexpected HTTP status codes are found in the logs of our public nginx ingress
              labels:
                severity: "critical"
              isPaused: false
              notification_settings:
                receiver: Main Alertmanager

Note

Grafana alerts live in Grafana’s own database: SQLite on the StatefulSet’s PVC by default, or in RDS Postgres under HA mode. Either way they persist across pod restarts.

Last updated on