Grafana

Grafana is deployed as part of the monitoring stack and wired to Prometheus, Loki, CloudWatch, Alertmanager, and (optionally) Tempo by default. It’s exposed at https://grafana.<cluster_fqdn> and authenticated through DEX.

This page collects the day-to-day tasks you may need to perform on Grafana: adjusting configuration via the cluster definition, shipping dashboards, and firing alerts.

Configure Grafana via the cluster definition

All Grafana configuration lives under spec.cluster_monitoring.grafana in your cluster definition. The schema descriptions in .vscode/cluster-definition-schema-eks.yaml are authoritative for per-field details; the recipes below cover the common cases.

Ship dashboards as ConfigMaps

Beyond creating dashboards directly in the Grafana UI, you can version-control them as ConfigMaps — useful if you want them deployed alongside your application. Grafana picks up any ConfigMap labeled grafana_dashboard (any value will do):

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    grafana_dashboard: application
  name: grafana-dashboard-myapplication
  namespace: mynamespace
data:
  grafana-dashboard-myapplication.json: |
    { <Grafana dashboard json> }

Make sure dashboards reference "Prometheus" (or one of your declared datasources) as the datasource UID.

You can get inspired by some of the dashboards already deployed in the cluster:

kubectl get configmaps -l grafana_dashboard --all-namespaces

Add custom datasources

Beyond the built-in Prometheus/Loki/CloudWatch/Alertmanager/Tempo datasources, you can add your own in two ways.

As a ConfigMap (declarative)

Similar to dashboards — create a ConfigMap labeled grafana_datasource (any value will do) containing a Grafana datasource provisioning YAML. The Grafana sidecar picks it up and writes it into Grafana’s provisioning directory.

Simple example — point at an Elasticsearch cluster reachable from the EKS cluster (e.g. an AWS OpenSearch domain):

apiVersion: v1
kind: ConfigMap
metadata:
  name: elasticsearch-logs-datasource
  namespace: mynamespace
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Elasticsearch-Logs
        type: elasticsearch
        uid: elasticsearch-logs
        url: https://vpc-logs.eu-west-1.es.amazonaws.com
        access: proxy
        database: "logs-*"
        editable: false
        jsonData:
          esVersion: "8.0.0"
          timeField: "@timestamp"
          logMessageField: message
          logLevelField: level

More advanced — multi-tenant Loki, where each tenant gets its own datasource with a distinct X-Scope-OrgID header, optionally scoped to a specific Grafana org:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-tenants-datasource
  namespace: mynamespace
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Loki-TenantA
        type: loki
        uid: loki-tenant-a
        url: http://loki-read:3100
        access: proxy
        orgId: 2
        editable: false
        jsonData:
          httpHeaderName1: X-Scope-OrgID
        secureJsonData:
          httpHeaderValue1: tenant-a

Each entry follows Grafana’s datasource provisioning schema, so any field (jsonData, secureJsonData, orgId, …) is supported.

Important

The datasource sidecar runs as an initContainer only (live reload is incompatible with our OAuth setup), so Grafana must be restarted to pick up a new or changed ConfigMap.

kubectl rollout restart statefulset cluster-monitoring-grafana -n infrastructure

Warning

Values in secureJsonData are stored in the ConfigMap in plaintext. Don’t commit real secrets to git — encrypt sensitive ConfigMaps at rest (sops, sealed-secrets) or add them manually via the UI instead.

Manually via the Grafana UI

Add a datasource under Connections → Data sources in the Grafana UI. Grafana’s database lives on the StatefulSet’s PVC (cluster_monitoring.grafana.persistence is enabled by default), so manually-created datasources persist across pod restarts.

This is convenient for quick exploration or for datasources that need real secrets you don’t want in git. Caveat: it’s not declarative — the datasource won’t survive a PVC loss and won’t appear in another cluster; prefer ConfigMaps for anything long-lived.

Promote specific users to GrafanaAdmin

The DEX-backed [auth.generic_oauth] section is preset with the minimum required keys (enabled, client_id, scopes, OAuth URLs, etc.). To extend it — typically to map OAuth claims onto Grafana roles — use generic_oauth_extras:

spec:
  cluster_monitoring:
    grafana:
      generic_oauth_extras:
        role_attribute_path: "contains(email, 'alice@example.com') && 'GrafanaAdmin' || contains(email, 'bob@example.com') && 'GrafanaAdmin' || 'Editor'"
        allow_assign_grafana_admin: true

The expression elevates two specific users to GrafanaAdmin and defaults everyone else to Editor. See Grafana’s generic OAuth reference for the full list of supported keys.

Important

Don’t override keys already set by the stack (enabled, auto_login, client_id, scopes, auth_url, token_url, api_url, skip_org_role_sync) via this field — doing so would break OAuth.

Add an extra OAuth provider

custom_auth_config adds a new grafana.ini auth section alongside the built-in DEX flow (auth.generic_oauth). For the provider’s secret, use custom_auth_secrets, which creates KMS-decrypted env vars in the Grafana pod.

Example for Entra ID (Azure AD):

spec:
  cluster_monitoring:
    grafana:
      custom_auth_config:
        auth.azuread:
          name: "Entra ID"
          enabled: true
          auto_login: false
          client_id: CLIENT_ID
          scopes: "openid email profile groups"
          auth_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/authorize"
          token_url: "https://login.microsoftonline.com/TENANT_ID/oauth2/v2.0/token"
          allowed_organizations: TENANT_ID
          use_pkce: true
      custom_auth_secrets:
        GF_AUTH_AZUREAD_CLIENT_SECRET: a-kms-encrypted-payload

Each value in custom_auth_secrets must be KMS-encrypted with the context k8s_stack=secrets.

Enable feature toggles

Experimental Grafana features are enabled via featureToggles (merged into the [feature_toggles] section of grafana.ini):

spec:
  cluster_monitoring:
    grafana:
      featureToggles:
        provisioning: true
        kubernetesDashboards: true

See Grafana’s feature-toggles reference for the full catalog.

Install plugins

spec:
  cluster_monitoring:
    grafana:
      plugins:
        - grafana-piechart-panel

Enable extra built-in dashboards

Skyscrapers bundles a set of optional dashboards you can enable by name:

spec:
  cluster_monitoring:
    grafana:
      extra_dashboards:
        - sqs

Fire alerts from Grafana

Beyond PrometheusRule, you can define alerts using Grafana’s alerting system, which is useful for multi-datasource queries (for example, correlating Loki logs with Prometheus metrics). The Skyscrapers-managed Grafana ships with a Main Alertmanager Contact Point wired to the cluster’s Alertmanager — use it so your Grafana alerts get the same routing as Prometheus alerts.

Start by building the alert in the Grafana UI, then export it as YAML and ship it as a ConfigMap labeled grafana_alert (any value will do). The following example triggers when unexpected HTTP status codes appear in the public nginx ingress logs:

kind: ConfigMap
apiVersion: v1
metadata:
  name: http-errors-nginx-ingress-alert-rule
  namespace: infrastructure
  labels:
    grafana_alert: "logs-nginx-http-errors"
data:
  rules.yaml: |
    groups:
      - name: logs-nginx-http-errors
          folder: Ingress Alerts
          interval: 5m
          rules:
            - title: Unexpected HTTP status codes discovered in the nginx-ingress logs
              condition: A
              data:
                - refId: A
                  queryType: range
                  relativeTimeRange:
                    from: 600
                    to: 0
                  datasourceUid: P8E80F9AEF21F6940
                  model:
                    datasource:
                        type: loki
                        uid: P8E80F9AEF21F6940
                    editorMode: code
                    expr: |-
                        sum by () (
                          count_over_time(
                            {app_kubernetes_io_name="nginx-ingress"}
                            | json
                            | httpRequest_status !=~ "^2\\d\\d$"
                          [5m])
                        )
                    instant: true
                    intervalMs: 1000
                    maxDataPoints: 43200
                    queryType: range
                    refId: A
              noDataState: OK
              execErrState: KeepLast
              for: 5m
              annotations:
                summary: Unexpected HTTP status codes are found in the logs of our public nginx ingress
              labels:
                severity: "critical"
              isPaused: false
              notification_settings:
                receiver: Main Alertmanager

Note

Grafana alerts live in Grafana’s SQLite DB on the StatefulSet’s PVC, so they persist across pod restarts as long as cluster_monitoring.grafana.persistence stays enabled (default).

Last updated on