Runbook
This is the Skyscrapers alerts runbook (or playbook), inspired by the upstream kubernetes-mixin runbook.
On this page you should find detailed information about specific alerts comming from your monitoring system. It’s possible though that some alerts haven’t been documented yet or the information is incomplete or outdated. If you find a missing alert or inacurate information, feel free to submit an issue or pull request.
In addition to the alerts listed on this page, there are other system alerts that are described in upstream runbooks, like the one linked above. Always follow the runbook_url annotation link (:notebook:) in the alert notification to get the most recent and up-to-date information about that alert.
Kubernetes alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| MemoryOvercommitted | {{$labels.node}} is overcommited by {{$value}}% | warning | Check actual memory usage. If not high, adjust Pod resource requests. Otherwise, add more nodes. |
| CPUUsageHigh | {{$labels.instance}} is using more than 90% CPU for >1h | warning | |
| NodeWithImpairedVolumes | EBS volumes are failing to attach to node {{$labels.node}} | critical | Check AWS Console for stuck volumes. Unattach volume, delete Pod, untaint node when resolved. |
| ContainerExcessiveCPU | Container {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} high CPU for 30 min | warning | Check usage via kubectl top or Grafana. Consider increasing CPU request/limit. |
| ContainerExcessiveMEM | Container {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} high Memory for 30 min | warning | Check usage via kubectl top or Grafana. Consider increasing Memory request/limit. |
ElasticSearch alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| ElasticsearchExporterDown | The Elasticsearch metrics exporter for {{ $labels.job }} is down! | critical | Check exporter pods and logs. |
| ElasticsearchCloudwatchExporterDown | The Elasticsearch Cloudwatch metrics exporter for {{ $labels.job }} is down! | critical | Check exporter pods and logs. |
| ElasticsearchClusterHealthYellow | Elasticsearch cluster health for {{ $labels.cluster }} is yellow. | warning | |
| ElasticsearchClusterHealthRed | Elasticsearch cluster health for {{ $labels.cluster }} is RED! | critical | |
| ElasticsearchClusterEndpointDown | Elasticsearch cluster endpoint for {{ $labels.cluster }} is DOWN! | critical | |
| ElasticsearchAWSLowDiskSpace | AWS Elasticsearch cluster {{ $labels.cluster }} is low on free disk space | warning | See extended description in original doc. |
| ElasticsearchAWSNoDiskSpace | AWS Elasticsearch cluster {{ $labels.cluster }} has no free disk space | critical | See extended description in original doc. |
| ElasticsearchIndexWritesBlocked | AWS Elasticsearch cluster {{ $labels.cluster }} is blocking incoming write requests | critical | See AWS docs |
| ElasticsearchLowDiskSpace | Elasticsearch node {{ $labels.node }} on cluster {{ $labels.cluster }} is low on free disk space | warning | |
| ElasticsearchNoDiskSpace | Elasticsearch node {{ $labels.node }} on cluster {{ $labels.cluster }} has no free disk space | critical | |
| ElasticsearchHeapTooHigh | JVM heap usage for cluster {{ $labels.cluster }} on node {{ $labels.node }} > 90% for 15m | warning |
Fluent Bit alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| FluentbitDroppedRecords | Fluent Bit {{ $labels.pod }} is failing to save records to output {{ $labels.name }} | critical | Check pod logs. See Buffer_Size False. |
Flux alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| FluxResourceNotReady | {{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} hasn't been READY for 15 minutes! | critical/info/warning | Use Flux debugging docs. |
| FluxResourceSuspended | {{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been SUSPENDED for 8 hours! | warning | Use flux resume ... to unsuspend. |
Loki alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| LokiDiscardingSamples | Loki is discarding ingested samples for reason {{ $labels.reason }} | critical | Increase ingestion_rate_mb or ingestion_burst_size_mb. |
| LokiNotFlushingChunks | Loki writer {{ $labels.pod }} has not flushed any chunks for >40 min | critical | Check writer pod logs and S3 config. |
| LokiBackendCompactorFailing | Loki compactor {{ $labels.pod }} is failing | warning | Check backend pod logs. |
MongoDB alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| MongodbMetricsDown | The MongoDB metrics exporter for {{ $labels.job }} is down! | critical | Check exporter service and endpoints. |
| MongodbLowConnectionsAvailable | Low connections available on {{$labels.instance}} | warning | Investigate query buildup. |
| MongodbUnhealthyMember | A mongo node with issues has been detected on {{$labels.instance}} | critical | |
| MongodbReplicationLagWarning | Replication lag >10s on {{$labels.instance}} | warning | |
| MongodbReplicationLagCritical | Replication lag >60s on {{$labels.instance}} | critical |
RDS alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| RDSCPUCreditBalanceLow | CPU credit balance for {{ $labels.dbinstance_identifier }} < 5! | critical | |
| RDSFreeableMemoryLow | RDS instance {{ $labels.dbinstance_identifier }} low memory (< 100Mb)! | critical | |
| RDSFreeStorageSpaceRunningLow | Disk storage expected to fill up in 4 days for {{ $labels.dbinstance_identifier }} | critical | |
| RDSDiskQueueDepthHigh | Outstanding IO requests high for {{ $labels.dbinstance_identifier }} | critical | |
| RDSCPUUsageHigh | CPU usage for {{ $labels.dbinstance_identifier }} > 95%. | info | |
| RDSReplicaLagHigh | Replica lag for {{ $labels.dbinstance_identifier }} > 30s. | warning | |
| RDSBurstBalanceLow | EBS BurstBalance for {{ $labels.dbinstance_identifier }} < 20%. | warning | Check for heavy IO queries, consider increasing IOPS. |
Concourse alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| ConcourseWorkersMismatch | Stale Concourse workers for >1h | critical | |
| ConcourseWorkerCPUCreditBalanceLow | Minimum CPU credit balance of Concourse workers = 0 for 1h | critical | |
| ConcourseWorkerEBSIOBalanceLow | EBS IO balance of Concourse workers volumes = 0 for 1h | critical | |
| ConcourseEndpointDown | Concourse endpoint down for 5 min. | critical |
Redshift alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| RedshiftExporterDown | Redshift metrics exporter for {{ $labels.job }} is down! | critical | Check exporter pods and logs. |
| RedshiftHealthStatus | Redshift cluster not healthy for {{ $labels.cluster }}! | critical | Check AWS Redshift dashboard. |
| RedshiftMaintenanceMode | Redshift cluster in maintenance mode for {{ $labels.cluster }}! | warning | Maintenance mode may impact availability. |
| RedshiftLowDiskSpace | AWS Redshift cluster {{ $labels.cluster }} low on free disk space | warning | Check and increase disk space if needed. |
| RedshiftNoDiskSpace | AWS Redshift cluster {{ $labels.cluster }} out of free disk space | critical | Take immediate action to increase storage. |
| RedshiftCPUHigh | AWS Redshift cluster {{ $labels.cluster }} at max CPU for 30 min | warning | Investigate cause and take action. |
| VaultIsSealed | Vault is sealed and unable to auto-unseal | critical | Check logs for unseal issues. |
cert-manager alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| CertificateNotReady | A cert-manager certificate can not be issued/updated | warning | Check certificate events and pod logs. |
| CertificateAboutToExpire | A cert-manager certificate is about to expire | warning | Check certificate events and pod logs. |
| AmazonMQCWExporterDown | An AmazonMQ for RabbitMQ metrics exporter is down | warning/critical | Check why Cloudwatch exporter is failing. |
| AmazonMQMemoryAboveLimit | AmazonMQ for RabbitMQ node memory usage above limit | warning/critical | Switch to bigger instance type. |
| AmazonMQDiskFreeBelowLimit | AmazonMQ for RabbitMQ node free disk space below limit | warning/critical | Switch to bigger instance type. |
ExternalDNS alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| ExternalDnsRegistryErrorsIncrease | External DNS registry Errors increasing constantly | warning | See notes in original doc. |
| ExternalDNSSourceErrorsIncrease | External DNS source Errors increasing constantly | warning | See notes in original doc. |
Velero alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| VeleroBackupPartialFailures | `Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partial failures.` | warning |
| VeleroBackupFailures | `Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failures.` | warning |
| VeleroVolumeSnapshotFailures | `Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} volume snapshot failures.` | warning |
| VeleroBackupTooOld | Cluster hasn't been backed up for more than 3 days. | critical | Check other alerts for root cause. |
VPA alerts
| Alert Name | Description | Severity | Action / Notes |
|---|---|---|---|
| VPAAdmissionControllerDown | The VPA AdmissionController is down | warning | Debug logs, see upstream |
| VPAAdmissionControllerSlow | The VPA AdmissionController is slow | warning | Debug logs, see upstream |
| VPARecommenderDown | The VPA Recommender is down | warning | Debug logs, see upstream |
| VPARecommenderSlow | The VPA Recommender is slow | warning | Debug logs, see upstream |
| VPAUpdaterDown | The VPA Updater is down | warning | Debug logs, see upstream |
| VPAUpdaterSlow | The VPA Updater is slow | warning | Debug logs, see upstream |
Other Kubernetes Runbooks
Last updated on