Runbook

This is the Skyscrapers alerts runbook (or playbook), inspired by the upstream kubernetes-mixin runbook.

On this page you should find detailed information about specific alerts comming from your monitoring system. It’s possible though that some alerts haven’t been documented yet or the information is incomplete or outdated. If you find a missing alert or inacurate information, feel free to submit an issue or pull request.

In addition to the alerts listed on this page, there are other system alerts that are described in upstream runbooks, like the one linked above. Always follow the runbook_url annotation link (:notebook:) in the alert notification to get the most recent and up-to-date information about that alert.

AWS Backup alerts

Alert Name: BackupJobsNotCompleted

Description: There are backup jobs created but not completed to the {{ $labels.backup_vault }} in the last 4 hours
Severity: critical
Action: Check the AWS Backup console to see which jobs are stuck in CREATED or RUNNING state for the affected backup vault. Investigate the reason why they are not completing.

Alert Name: CopyJobsNotCompleted

Description: There are copy jobs created but not completed in the {{ $labels.backup_vault }} in the last 4 hours
Severity: critical
Action: Check the AWS Backup console to see which copy jobs are stuck in CREATED or RUNNING state for the affected backup vault. Investigate the reason why they are not completing.

Kubernetes alerts

Alert Name: MemoryOvercommitted

Description: {{$labels.node}} is overcommited by {{$value}}%
Severity: warning
Action: Check the actual memory usage of the cluster. If it’s not too high it might indicate that Pod resource requests are not adjusted properly. Otherwise try adding more nodes to the cluster.

Alert Name: CPUUsageHigh

Description: {{$labels.instance}} is using more than 90% CPU for >1h
Severity: warning

Alert Name: NodeWithImpairedVolumes

Description: EBS volumes are failing to attach to node {{$labels.node}}
Severity: critical
Action: Check the AWS Console which volumes are stuck in an Attaching state. Unattach the volume and delete the Pod the volume belongs to (most likely in Pending state) so it can be rescheduled on another node. Untaint the affected node when everything is OK again.

Alert Name: ContainerExcessiveCPU

Description: Container {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} has been using {{ $value | humanizePercentage }} of it's requested CPU for 30 min
Severity: warning
Action: Check resource usage of the container through kubectl top or Grafana. Consider increasing the container’s CPU request and/or setting a limit.

Alert Name: ContainerExcessiveMEM

Description: Container {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} has been using {{ $value | humanizePercentage }} of it's requested Memory for 30 min
Severity: warning
Action: Check resource usage of the container through kubectl top or Grafana. Consider increasing the container’s Memory request and/or setting a limit.

OpenSearch alerts

Alert Name: OpenSearchClusterHealthRed

Description: AWS OpenSearch cluster health for {{ $labels.domain_name }} is RED! At least one primary shard and its replicas are not allocated to a node. Check <https://docs.aws.amazon.com/opensearch-service/latest/developerguide/handling-errors.html#handling-errors-red-cluster-status> for more information.
Severity: critical

Alert Name: OpenSearchClusterHealthYellow

Description: AWS OpenSearch cluster health for {{ $labels.domain_name }} is YELLOW. At least one replica shard is not allocated to a node. Check <https://docs.aws.amazon.com/opensearch-service/latest/developerguide/handling-errors.html#handling-errors-yellow-cluster-status> for more information.
Severity: warning

Alert Name: OpenSearchFreeStorageSpaceRunningLow

Description: Based on recent sampling, storage is expected to fill up in 6 hours for Openseach {{ $labels.domain_name }}.
Severity: warning

Alert Name: OpenSearchFreeStorageSpaceWarning

Description: Based on recent sampling, storage is expected to fill up in 4 days for OpenSearch {{ $labels.domain_name }}.
Severity: warning

Alert Name: OpenSearchFreeStorageSpaceCritical

Description: Based on recent sampling, storage is less than 20Gi and expected to fill up in 6 hours for OpenSearch {{ $labels.domain_name }}.
Severity: critical
Action: Increase volume sizes of the OpenSearch domain.

Alert Name: OpenSearchIndexWritesBlocked

Description: AWS OpenSearch {{ $labels.domain_name }} has been blocking incoming write requests for 5 minutes. Check <https://docs.aws.amazon.com/opensearch-service/latest/developerguide/handling-errors.html#troubleshooting-cluster-block> for more information.
Severity: critical
Action: Indicates that the ES cluster is blocking incoming write requests and you should check other alerts & the Opensearch Dashboard for the reason as to why. Some common factors include the following: FreeStorageSpace is too low or JVMMemoryPressure is too high. To alleviate this issue, consider adding more disk space or scaling your cluster.

Alert Name: OpenSearchJVMMemoryPressureHigh

Description: The JVM heap usage for OpenSearch domain {{ $labels.domain_name }} on node {{ $labels.node_id }} has been over 95% for 5 minutes. The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.
Severity: warning
Action: Consider scaling vertically first, and then horizontally if needed.

Alert Name: OpenSearchAutomatedSnapshotFailed

Description: AWS OpenSearch automated snapshot failed for domain {{ $labels.domain_name }}.
Severity: warning
Action: You can check the snapshot status the same way as for alert OpenSearchSnaphotFailures below. This alert means that the AWS-provided automated snapshots are failing, which usually is the case when is in a RED state or when another snapshot is already running.

Alert Name: OpenSearchSnaphotFailures

Description: There are OpenSearch snapshot failures for {{ $labels.job }} on repository {{ $labels.repository }}!
Severity: warning
Action: Check the snapshot status and look for errors via the OpenSearch Dashboard or with curl -s 'https://<endpoint>/_cat/snapshots/s3-manual?pretty&v' and curl -s 'https://<endpoint>/_snapshot?pretty' and look for errors. We deploy and make use of OpenSearch Snapshot Management, on top of the automated AWS snapshots. Check the Snapshot management documentation for more information.

Fluent Bit alerts

Alert Name: FluentbitDroppedRecords

Description: Fluent Bit {{ $labels.pod }} is failing to save records to output {{ $labels.name }}
Severity: critical
Action: Check the fluent-bit pod’s logs for error messages and to see which records are failing to upload.
- OpenSearch: In case of OpenSearch mapping errors, it’s possible that the ES HTTP response is cut off in the fluentbit logs. In that case, set Buffer_Size False in the fluent-bit ES OUTPUT config.
- OpenSearch: Make sure your application logs are structured (json recommended) and fields’ data types across your applications are consistent.

Flux alerts

Alert Name: FluxResourceNotReady

Note

This covers both the FluxSystemResourceNotReady (SkS system) and FluxAppsResourceNotReady (customer apps) alerts

Description: {{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} hasn't been READY for 15 minutes!
Severity: critical (SkS system) or customer choice: info, warning, critical (default)
Action: Use our Flux debugging docs.

Alert Name: FluxResourceSuspended

Note

This covers both the FluxSystemResourceSuspended (SkS system) and FluxAppsResourceSuspended (customer apps) alerts

Description: {{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been SUSPENDED for 8 hours!
Severity: warning
Action: Use flux resume ... on the resource to unsuspend.

Loki alerts

Alert Name: LokiDiscardingSamples

Description: Loki is discarding ingested samples for reason {{ $labels.reason }}
Severity: critical
Action: This metric records when an entry was rejected, with a reason tag saying why. The reason why the request is dropped is usually due to rate-limiting and it is up to the client (Fluent Bit) to retry. This alert is considered critical, because it means logs are lost if the client gives up. In this case you can increase the cluster definition values spec.grafana_loki.ingestion_rate_mb and/or spec.grafana_loki.ingestion_burst_size_mb, eg. try doubling their size.

Alert Name: LokiNotFlushingChunks

Description: Loki writer {{ $labels.pod }} has not flushed any chunks for more than 40 minutes
Severity: critical
Action: This alert could fire due to a problem in the Loki writers themselves, a configuration error with the S3 bucket, or due to an outage in S3. First check the Loki writer Pods for any signs of trouble. Check the Pods logs, if there’s a problem connecting to S3 it will show up in the writer logs. Revise the configuration of the components that allow the Pod to access S3 (ServiceAccount, IAM role and IAM policy).

Alert Name: LokiBackendCompactorFailing

Description: Loki compactor {{ $labels.pod }} is failing
Severity: warning
Action: The compactor is responsible for compacting the chunks in the storage. If the compactor is failing, it means that the chunks are not being compacted and the storage will grow indefinitely. Check the Backend Pod logs for any errors.

MongoDB alerts

Alert Name: MongodbMetricsDown

Description: The MongoDB metrics exporter for {{ $labels.job }} is down!
Severity: critical
Action: The MongoDB metrics exporter is running on the MongoDB servers. Check the mongodb-monitoring-metrics service to know which endpoints it is targetting by using kubectl describe service -n infrastructure mongodb-monitoring-metrics.

Alert Name: MongodbLowConnectionsAvailable

Description: Low connections available on {{$labels.instance}}
Severity: warning
Action: The cluster is running out of connections to accept. This needs to be looked at. Try to find the cause of the buildup of queries (slow queries or something that is going crazy). If the node runs out of connections the replication can come into problems and cause the whole cluster to fail.

Alert Name: MongodbUnhealthyMember

Description: A mongo node with issues has been detected on {{$labels.instance}}
Severity: critical

Alert Name: MongodbReplicationLagWarning

Description: The replication is running out on {{$labels.instance}} more than 10 seconds
Severity: warning

Alert Name: MongodbReplicationLagCritical

Description: The replication is running out on {{$labels.instance}} more than 60 seconds
Severity: critical

RDS alerts

Alert Name: RDSCPUCreditBalanceLow

Description: The CPU credit balance for {{ $labels.dbinstance_identifier }} is less than 5!
Severity: critical

Alert Name: RDSFreeableMemoryLow

Description: RDS instance {{ $labels.dbinstance_identifier }} is running low on memory (< 100Mb)!
Severity: critical

Alert Name: RDSFreeStorageSpaceRunningLow

Description: Based on recent sampling, disk storage is expected to fill up in four days for RDS instance {{ $labels.dbinstance_identifier }}.
Severity: critical

Alert Name: RDSDiskQueueDepthHigh

Description: The number of outstanding IO requests waiting to access the disk is high for RDS instance {{ $labels.dbinstance_identifier }}.
Severity: critical

Alert Name: RDSCPUUsageHigh

Description: CPU usage for RDS instance {{ $labels.dbinstance_identifier }} is higher than 95%.
Severity: info

Alert Name: RDSReplicaLagHigh

Description: Replica lag for RDS instance {{ $labels.dbinstance_identifier }} is higher than 30 seconds.
Severity: warning

Alert Name: RDSBurstBalanceLow

Description: EBS BurstBalance for RDS instance {{ $labels.dbinstance_identifier }} is lower than 20%.
Severity: warning
Action: The amount of IOPS on the disk is too high and the RDS is running out of credits. Check the RDS for heavy IO intensive queries (check with the customer). If needed the IOPS need to be increased (can be done by increasing the EBS volume or changing to a PIOPS volume).

Concourse alerts

ConcourseWorkersMismatch

Description: There are stale Concourse workers for more than an hour
Severity: `critical

ConcourseWorkerCPUCreditBalanceLow

Description: Minimum CPU credit balance of one of Concourse workers has reached 0 for an hour
Severity: `critical

ConcourseWorkerEBSIOBalanceLow

Description: EBS IO balance balance of one of Concourse workers volumues has reached 0 for an hour
Severity: `critical

Aler Name: ConcourseEndpointDown

Description: Concourse endpoint has been down for 5 minutes.
Severity: critical

Redshift alerts

Alert Name: RedshiftExporterDown

Description: The Redshift metrics exporter for {{ $labels.job }} is down!
Severity: critical
Action: Check the Redshift-exporter Pods with kubectl get pods -n infrastructure -l app=redshift-exporter,release=logging-redshift-monitor and look at the Pod logs using kubectl logs -n infrastructure <pod> for further information.

Alert Name: RedshiftHealthStatus

Description: Redshift cluster is not healthy for {{ $labels.cluster }}!
Severity: critical
Action: logon to the aws account and open the Redshift dashboard. Investigate why the cluster is not healthy and take action

Alert Name: RedshiftMaintenanceMode

Description: Redshift cluster is in maintenance mode for {{ $labels.cluster }}!
Severity: warning
Action: Maintenance mode is active for the cluster (this should normally be planned and expected). Cluster availability might be impacted.

Alert Name: RedshiftLowDiskSpace

Description: AWS Redshift cluster {{ $labels.cluster }} is low on free disk space
Severity: warning
Action: Disk space is running low on the cluster. Check off if this is expected and take action to increase the disk space together with the lead engineer and the customer.

Alert Name: RedshiftNoDiskSpace

Description: AWS Redshift cluster {{ $labels.cluster }} is out of free disk space
Severity: critical
Action: There is no disk space left on the cluster. Take immediate action and increase storage on the cluster.

Alert Name: RedshiftCPUHigh

Description: AWS Redshift cluster {{ $labels.cluster }} is running at max CPU for 30 minutes
Severity: warning
Action: The cluster is running at max CPU for at least 30 minutes. Check what causes this together with the customer and if needed take action.

cert-manager alerts

Alert Name: CertificateNotReady

Description: A cert-manager certificate can not be issued/updated
Severity: warning
Action: A certificate fails to be ready within 10 mintues during an issuing or an update event, check the certificate events and the certmanager pod logs to get the reason of the failure.

Alert Name: CertificateAboutToExpire

Description: A cert-manager certificate is about to expire
Severity: warning
Action: A certificate has less than x weeks to expire and did not get renewed, check the certificate events and the certmanager pod logs to get the reason of the failure.

Alert Name: AmazonMQCWExporterDown

Description: An AmazonMQ for RabbitMQ metrics exporter is down
Severity: warning | critical
Action: Check why the Cloudwatch exporter is failing.

Alert Name: AmazonMQMemoryAboveLimit

Description: AmazonMQ for RabbitMQ node memory usage is above the limit. Cluster is now blocking producer connections
Severity: warning | critical
Action: Switch to a bigger instance type for your AmazonMQ broker

Alert Name: AmazonMQDiskFreeBelowLimit

Description: AmazonMQ for RabbitMQ node free disk space is lower than the limit. Cluster is now blocking producer connections
Severity: warning | critical
Action: Switch to a bigger instance type for your AmazonMQ broker

ExternalDNS alerts

Alert Name: ExternalDnsRegistryErrorsIncrease

Description: External DNS registry Errors increasing constantly
Severity: warning
Action: Registry errors are mostly Provider errors, unless there’s some coding flaw in the registry package. Provider errors often arise due to accessing their APIs due to network or missing cloud-provider permissions when reading records. When applying a changeset, errors will arise if the changeset applied is incompatible with the current state. In case of an increased error count, you could correlate them with the http_request_duration_seconds{handler="instrumented_http"} metric which should show increased numbers for status codes 4xx (permissions, configuration, invalid changeset) or 5xx (apiserver down). You can use the host label in the metric to figure out if the request was against the Kubernetes API server (Source errors) or the DNS provider API (Registry/Provider errors).

Alert Name: ExternalDNSSourceErrorsIncrease

Description: External DNS source Errors increasing constantly
Severity: warning
Action: Sources are mostly Kubernetes API objects. Examples of source errors may be connection errors to the Kubernetes API server itself or missing RBAC permissions. It can also stem from incompatible configuration in the objects itself like invalid characters, processing a broken fqdnTemplate, etc. In case of an increased error count, you could correlate them with the http_request_duration_seconds{handler="instrumented_http"} metric which should show increased numbers for status codes 4xx (permissions, configuration, invalid changeset) or 5xx (apiserver down). You can use the host label in the metric to figure out if the request was against the Kubernetes API server (Source errors) or the DNS provider API (Registry/Provider errors).

Velero alerts

Alert Name: VeleroBackupPartialFailures

Description: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partial failures.
Severity: warning
Action: Check whether the velero Pod crashes during backup (eg. OOMKill). Check whether velero is trying to backup stale PV/PVCs with a deleted cloud-provider volume (velero back describe ... & velero backup logs ...).

Alert Name: VeleroBackupFailures

Description: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failures.
Severity: warning
Action: Check the backup (velero back describe ...) and it’s logs (velero backup logs ...) for the failure reason.

Alert Name: VeleroVolumeSnapshotFailures

Description: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} volume snapshot failures.
Severity: warning
Action: Check the backup (velero back describe ...) and it’s logs (velero backup logs ...) for the failure reason.

Alert Name: VeleroBackupTooOld

Description: Cluster hasn't been backed up for more than 3 days.
Severity: critical
Action: This will fire if backups have failed due to any of the above reasons, check the other alerts to figure out what’s wrong.

VPA alerts

Alert Name: VPAAdmissionControllerDown

Description: The VPA AdmissionController is down
Severity: warning
Action: The AdmissionController part is down of the VPA. Debug in logs and see upstream for more info

Alert Name: VPAAdmissionControllerSlow

Description: The VPA AdmissionController is slow
Severity: warning
Action: The AdmissionController part is slow of the VPA. Requests are taking slower than 5s to complete. Debug in logs and see upstream for more info

Alert Name: VPARecommenderDown

Description: The VPA Recommender is down
Severity: warning
Action: The Recommender part is down of the VPA. Debug in logs and see upstream for more info

Alert Name: VPARecommenderSlow

Description: The VPA Recommender is slow
Severity: warning
Action: The Recommender part is slow of the VPA. Requests are taking slower than 5s to complete. Debug in logs and see upstream for more info

Alert Name: VPAUpdaterDown

Description: The VPA Updater is down
Severity: warning
Action: The Updater part is down of the VPA. Debug in logs and see upstream for more info

Alert Name: VPAUpdaterSlow

Description: The VPA Updater is slow
Severity: warning
Action: The Updater part is slow of the VPA. Requests are taking slower than 5s to complete. Debug in logs and see upstream for more info

Other Kubernetes Runbooks and troubleshooting

Last updated on February 26, 2026

Platform & Application Responsibility Definitions Service definition DR/BC Add-on