2026-04-22
Upgraded cluster add-ons and resilience improvements
#add-on #kubernetes #update #upgrade #component #eks
The following updates are being rolled out to all clusters. Alongside the regular add-on upgrades, this round also includes resilience improvements for critical monitoring components and a hardening of the anti-affinity rules across our platform controllers, so pods are actively spread across nodes (hard) and availability zones (soft).
Resilience improvements
PodDisruptionBudgets for Alertmanager and Prometheus
Alertmanager and Prometheus now get a PodDisruptionBudget when configured with more than one replica (default), ensuring at least one instance stays available during voluntary disruptions such as node drains or cluster autoscaling events. The PDB uses unhealthyPodEvictionPolicy: AlwaysAllow so unhealthy pods can still be evicted for self-healing to kick in.
Strengthened pod anti-affinity rules
Critical multi-replica controllers — Traefik, ingress-nginx, KEDA, AWS Load Balancer Controller, Alertmanager, Prometheus, and Tailscale — now enforce two anti-affinity rules:
- Required anti-affinity on
kubernetes.io/hostname— no two replicas can land on the same node, eliminating a single-node failure from taking down every replica at once. Previously we only Preferred this anti-affinity instead of enforcing it. - Preferred anti-affinity on
topology.kubernetes.io/zone— the scheduler will prefer to spread replicas across Availability Zones when capacity allows, improving survivability of zonal outages, when possible.
Both rules use matchLabelKeys (pod-template-hash for Deployments, controller-revision-hash for StatefulSets) to scope the anti-affinity per revision. This avoids rolling-update stalls where a new pod has nowhere to schedule because the old replicas still occupy every node.
Add-on upgrades
- alloy v1.15.0 (chart v1.7.0)
- amazon-eks-ami v20260403
- aws-ebs-csi-driver v1.58.0-eksbuild.1
- aws-efs-csi-driver v3.0.1 (chart v4.0.1)
- Major app version bump; the deprecated
pathvolume attribute has been removed upstream
- Major app version bump; the deprecated
- aws-load-balancer-controller v3.2.2
- Gateway API v1.5.0 enhancements
- aws-mountpoint-s3-csi-driver v2.5.0-eksbuild.1
- Enhanced pod-termination ordering
- aws-vpc-cni v1.21.1-eksbuild.7
- cert-manager v1.20.2
- eks-node-monitoring-agent v1.6.3-eksbuild.1
- tcpdump-based packet capture support and configurable probes/affinities
- fluent-bit v5.0.3 (chart v0.57.3)
- Major version bump. Some
http_server.*metric prefix renames; these are not used in our configuration
- Major version bump. Some
- flux v2.8.6
- gha-runner-scale-set v0.14.1
- ingress-nginx v1.15.1 (chart v4.15.1)
- karpenter v1.11.1
- Support for EC2 placement groups (required a new
ec2:DescribePlacementGroupsIAM permission on our Karpenter controller role) - Includes the CPU-usage regression fix from v1.11.0
- The Karpenter IAM policy, SQS queue policy, and EventBridge rules have also been synced against the upstream CloudFormation reference, picking up
ec2:DescribeInstanceStatus, an SQSDenyHTTPstatement blocking non-TLS access, and a new EventBridge rule for Capacity Reservation Interruption warnings
- Support for EC2 placement groups (required a new
- kube-prometheus-stack v83.6.0 — bundled subcomponent highlights:
- Prometheus Operator v0.90.1 (from v0.89.0) — new
schedulerNamefield onPrometheus/Alertmanager/PrometheusAgent/ThanosRulerCRDs, new--repair-policy-for-statefulsetsflag recommended on Kubernetes 1.35+ to auto-recover pods stuck on a bad revision, and expanded Alertmanager config support (messageTextfor Slack,forceImplicitTLSfor SMTP, global Telegram bot token) - Alertmanager v0.32.0 (from v0.31.1) — silence annotations, silence logging, multi-matcher-set silences, full payload templating for the webhook notifier, new
dict/map/appendtemplate functions, and Slack receivers can now edit existing messages - Prometheus v3.11.2 (from v3.10.0) — security fix for a stored XSS in the UI (CVE-2026-40179), new pod-based Kubernetes SD labels for deployment/cronjob/job controller names, experimental
histogram_quantilesPromQL function, newstorage.tsdb.retention.percentageoption, and a fix for alert state incorrectly resetting to pending when theFORperiod is increased via config reload - Grafana v12.4.3 (from v12.4.1, subchart 11.3.2 → 11.6.1) — patch bumps with security fixes for CVE-2026-27876, CVE-2026-27877, CVE-2026-28375, and CVE-2026-27879
- Prometheus Operator v0.90.1 (from v0.89.0) — new
- kube-proxy v1.35.3-eksbuild.2
- metrics-server v0.8.1-eksbuild.6
- nvidia-device-plugin v0.19.0
- oauth2-proxy v7.15.2 (chart v10.4.3)
- prometheus-blackbox-exporter v0.28.0 (chart v11.9.1)
- secrets-store-csi-driver-provider-aws v3.0.1
- Major version bump; the provider now requires explicit
tokenRequestsaudiences to be declared since we install the CSI driver separately. This is handled transparently in our Helm values
- Major version bump; the provider now requires explicit
- tailscale v1.96.5
- traefik v3.6.13 (chart v39.0.8)
- velero v1.18.0 (chart v12.0.0)