2026-04-22

Upgraded cluster add-ons and resilience improvements

#add-on  #kubernetes  #update  #upgrade  #component  #eks 

The following updates are being rolled out to all clusters. Alongside the regular add-on upgrades, this round also includes resilience improvements for critical monitoring components and a hardening of the anti-affinity rules across our platform controllers, so pods are actively spread across nodes (hard) and availability zones (soft).

Resilience improvements

PodDisruptionBudgets for Alertmanager and Prometheus

Alertmanager and Prometheus now get a PodDisruptionBudget when configured with more than one replica (default), ensuring at least one instance stays available during voluntary disruptions such as node drains or cluster autoscaling events. The PDB uses unhealthyPodEvictionPolicy: AlwaysAllow so unhealthy pods can still be evicted for self-healing to kick in.

Strengthened pod anti-affinity rules

Critical multi-replica controllers — Traefik, ingress-nginx, KEDA, AWS Load Balancer Controller, Alertmanager, Prometheus, and Tailscale — now enforce two anti-affinity rules:

  • Required anti-affinity on kubernetes.io/hostname — no two replicas can land on the same node, eliminating a single-node failure from taking down every replica at once. Previously we only Preferred this anti-affinity instead of enforcing it.
  • Preferred anti-affinity on topology.kubernetes.io/zone — the scheduler will prefer to spread replicas across Availability Zones when capacity allows, improving survivability of zonal outages, when possible.

Both rules use matchLabelKeys (pod-template-hash for Deployments, controller-revision-hash for StatefulSets) to scope the anti-affinity per revision. This avoids rolling-update stalls where a new pod has nowhere to schedule because the old replicas still occupy every node.

Add-on upgrades