Runbook

This is the Skyscrapers alerts runbook (or playbook), inspired by the upstream kubernetes-mixin runbook.

On this page you should find detailed information about specific alerts comming from your monitoring system. It’s possible though that some alerts haven’t been documented yet or the information is incomplete or outdated. If you find a missing alert or inacurate information, feel free to submit an issue or pull request.

In addition to the alerts listed on this page, there are other system alerts that are described in upstream runbooks, like the one linked above. Always follow the runbook_url annotation link (:notebook:) in the alert notification to get the most recent and up-to-date information about that alert.

Kubernetes alerts

Alert NameDescriptionSeverityAction / Notes
MemoryOvercommitted{{$labels.node}} is overcommited by {{$value}}%warningCheck actual memory usage. If not high, adjust Pod resource requests. Otherwise, add more nodes.
CPUUsageHigh{{$labels.instance}} is using more than 90% CPU for >1hwarning
NodeWithImpairedVolumesEBS volumes are failing to attach to node {{$labels.node}}criticalCheck AWS Console for stuck volumes. Unattach volume, delete Pod, untaint node when resolved.
ContainerExcessiveCPUContainer {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} high CPU for 30 minwarningCheck usage via kubectl top or Grafana. Consider increasing CPU request/limit.
ContainerExcessiveMEMContainer {{ $labels.container }} of pod {{ $labels.namespace }}/{{ $labels.pod }} high Memory for 30 minwarningCheck usage via kubectl top or Grafana. Consider increasing Memory request/limit.

ElasticSearch alerts

Alert NameDescriptionSeverityAction / Notes
ElasticsearchExporterDownThe Elasticsearch metrics exporter for {{ $labels.job }} is down!criticalCheck exporter pods and logs.
ElasticsearchCloudwatchExporterDownThe Elasticsearch Cloudwatch metrics exporter for {{ $labels.job }} is down!criticalCheck exporter pods and logs.
ElasticsearchClusterHealthYellowElasticsearch cluster health for {{ $labels.cluster }} is yellow.warning
ElasticsearchClusterHealthRedElasticsearch cluster health for {{ $labels.cluster }} is RED!critical
ElasticsearchClusterEndpointDownElasticsearch cluster endpoint for {{ $labels.cluster }} is DOWN!critical
ElasticsearchAWSLowDiskSpaceAWS Elasticsearch cluster {{ $labels.cluster }} is low on free disk spacewarningSee extended description in original doc.
ElasticsearchAWSNoDiskSpaceAWS Elasticsearch cluster {{ $labels.cluster }} has no free disk spacecriticalSee extended description in original doc.
ElasticsearchIndexWritesBlockedAWS Elasticsearch cluster {{ $labels.cluster }} is blocking incoming write requestscriticalSee AWS docs
ElasticsearchLowDiskSpaceElasticsearch node {{ $labels.node }} on cluster {{ $labels.cluster }} is low on free disk spacewarning
ElasticsearchNoDiskSpaceElasticsearch node {{ $labels.node }} on cluster {{ $labels.cluster }} has no free disk spacecritical
ElasticsearchHeapTooHighJVM heap usage for cluster {{ $labels.cluster }} on node {{ $labels.node }} > 90% for 15mwarning

Fluent Bit alerts

Alert NameDescriptionSeverityAction / Notes
FluentbitDroppedRecordsFluent Bit {{ $labels.pod }} is failing to save records to output {{ $labels.name }}criticalCheck pod logs. See Buffer_Size False.

Flux alerts

Alert NameDescriptionSeverityAction / Notes
FluxResourceNotReady{{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} hasn't been READY for 15 minutes!critical/info/warningUse Flux debugging docs.
FluxResourceSuspended{{ $labels.customresource_kind }} {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been SUSPENDED for 8 hours!warningUse flux resume ... to unsuspend.

Loki alerts

Alert NameDescriptionSeverityAction / Notes
LokiDiscardingSamplesLoki is discarding ingested samples for reason {{ $labels.reason }}criticalIncrease ingestion_rate_mb or ingestion_burst_size_mb.
LokiNotFlushingChunksLoki writer {{ $labels.pod }} has not flushed any chunks for >40 mincriticalCheck writer pod logs and S3 config.
LokiBackendCompactorFailingLoki compactor {{ $labels.pod }} is failingwarningCheck backend pod logs.

MongoDB alerts

Alert NameDescriptionSeverityAction / Notes
MongodbMetricsDownThe MongoDB metrics exporter for {{ $labels.job }} is down!criticalCheck exporter service and endpoints.
MongodbLowConnectionsAvailableLow connections available on {{$labels.instance}}warningInvestigate query buildup.
MongodbUnhealthyMemberA mongo node with issues has been detected on {{$labels.instance}}critical
MongodbReplicationLagWarningReplication lag >10s on {{$labels.instance}}warning
MongodbReplicationLagCriticalReplication lag >60s on {{$labels.instance}}critical

RDS alerts

Alert NameDescriptionSeverityAction / Notes
RDSCPUCreditBalanceLowCPU credit balance for {{ $labels.dbinstance_identifier }} < 5!critical
RDSFreeableMemoryLowRDS instance {{ $labels.dbinstance_identifier }} low memory (< 100Mb)!critical
RDSFreeStorageSpaceRunningLowDisk storage expected to fill up in 4 days for {{ $labels.dbinstance_identifier }}critical
RDSDiskQueueDepthHighOutstanding IO requests high for {{ $labels.dbinstance_identifier }}critical
RDSCPUUsageHighCPU usage for {{ $labels.dbinstance_identifier }} > 95%.info
RDSReplicaLagHighReplica lag for {{ $labels.dbinstance_identifier }} > 30s.warning
RDSBurstBalanceLowEBS BurstBalance for {{ $labels.dbinstance_identifier }} < 20%.warningCheck for heavy IO queries, consider increasing IOPS.

Concourse alerts

Alert NameDescriptionSeverityAction / Notes
ConcourseWorkersMismatchStale Concourse workers for >1hcritical
ConcourseWorkerCPUCreditBalanceLowMinimum CPU credit balance of Concourse workers = 0 for 1hcritical
ConcourseWorkerEBSIOBalanceLowEBS IO balance of Concourse workers volumes = 0 for 1hcritical
ConcourseEndpointDownConcourse endpoint down for 5 min.critical

Redshift alerts

Alert NameDescriptionSeverityAction / Notes
RedshiftExporterDownRedshift metrics exporter for {{ $labels.job }} is down!criticalCheck exporter pods and logs.
RedshiftHealthStatusRedshift cluster not healthy for {{ $labels.cluster }}!criticalCheck AWS Redshift dashboard.
RedshiftMaintenanceModeRedshift cluster in maintenance mode for {{ $labels.cluster }}!warningMaintenance mode may impact availability.
RedshiftLowDiskSpaceAWS Redshift cluster {{ $labels.cluster }} low on free disk spacewarningCheck and increase disk space if needed.
RedshiftNoDiskSpaceAWS Redshift cluster {{ $labels.cluster }} out of free disk spacecriticalTake immediate action to increase storage.
RedshiftCPUHighAWS Redshift cluster {{ $labels.cluster }} at max CPU for 30 minwarningInvestigate cause and take action.
VaultIsSealedVault is sealed and unable to auto-unsealcriticalCheck logs for unseal issues.

cert-manager alerts

Alert NameDescriptionSeverityAction / Notes
CertificateNotReadyA cert-manager certificate can not be issued/updatedwarningCheck certificate events and pod logs.
CertificateAboutToExpireA cert-manager certificate is about to expirewarningCheck certificate events and pod logs.
AmazonMQCWExporterDownAn AmazonMQ for RabbitMQ metrics exporter is downwarning/criticalCheck why Cloudwatch exporter is failing.
AmazonMQMemoryAboveLimitAmazonMQ for RabbitMQ node memory usage above limitwarning/criticalSwitch to bigger instance type.
AmazonMQDiskFreeBelowLimitAmazonMQ for RabbitMQ node free disk space below limitwarning/criticalSwitch to bigger instance type.

ExternalDNS alerts

Alert NameDescriptionSeverityAction / Notes
ExternalDnsRegistryErrorsIncreaseExternal DNS registry Errors increasing constantlywarningSee notes in original doc.
ExternalDNSSourceErrorsIncreaseExternal DNS source Errors increasing constantlywarningSee notes in original doc.

Velero alerts

Alert NameDescriptionSeverityAction / Notes
VeleroBackupPartialFailures`Velero backup {{ $labels.schedule }} has {{ $valuehumanizePercentage }} partial failures.`warning
VeleroBackupFailures`Velero backup {{ $labels.schedule }} has {{ $valuehumanizePercentage }} failures.`warning
VeleroVolumeSnapshotFailures`Velero backup {{ $labels.schedule }} has {{ $valuehumanizePercentage }} volume snapshot failures.`warning
VeleroBackupTooOldCluster hasn't been backed up for more than 3 days.criticalCheck other alerts for root cause.

VPA alerts

Alert NameDescriptionSeverityAction / Notes
VPAAdmissionControllerDownThe VPA AdmissionController is downwarningDebug logs, see upstream
VPAAdmissionControllerSlowThe VPA AdmissionController is slowwarningDebug logs, see upstream
VPARecommenderDownThe VPA Recommender is downwarningDebug logs, see upstream
VPARecommenderSlowThe VPA Recommender is slowwarningDebug logs, see upstream
VPAUpdaterDownThe VPA Updater is downwarningDebug logs, see upstream
VPAUpdaterSlowThe VPA Updater is slowwarningDebug logs, see upstream

Other Kubernetes Runbooks

Last updated on