Upgraded cluster add-ons and resilience improvements

The following updates are being rolled out to all clusters. Alongside the regular add-on upgrades, this round also includes resilience improvements for critical monitoring components and a hardening of the anti-affinity rules across our platform controllers, so pods are actively spread across nodes (hard) and availability zones (soft).

Resilience improvements

PodDisruptionBudgets for Alertmanager and Prometheus

Alertmanager and Prometheus now get a PodDisruptionBudget when configured with more than one replica (default), ensuring at least one instance stays available during voluntary disruptions such as node drains or cluster autoscaling events. The PDB uses unhealthyPodEvictionPolicy: AlwaysAllow so unhealthy pods can still be evicted for self-healing to kick in.

Strengthened pod anti-affinity rules

Critical multi-replica controllers — Traefik, ingress-nginx, KEDA, AWS Load Balancer Controller, Alertmanager, Prometheus, and Tailscale — now enforce two anti-affinity rules:

Required anti-affinity on kubernetes.io/hostname — no two replicas can land on the same node, eliminating a single-node failure from taking down every replica at once. Previously we only Preferred this anti-affinity instead of enforcing it.
Preferred anti-affinity on topology.kubernetes.io/zone — the scheduler will prefer to spread replicas across Availability Zones when capacity allows, improving survivability of zonal outages, when possible.

Both rules use matchLabelKeys (pod-template-hash for Deployments, controller-revision-hash for StatefulSets) to scope the anti-affinity per revision. This avoids rolling-update stalls where a new pod has nowhere to schedule because the old replicas still occupy every node.

Add-on upgrades

alloy v1.15.0 (chart v1.7.0)
amazon-eks-ami v20260403
aws-ebs-csi-driver v1.58.0-eksbuild.1
aws-efs-csi-driver v3.0.1 (chart v4.0.1)
- Major app version bump; the deprecated path volume attribute has been removed upstream
aws-load-balancer-controller v3.2.2
- Gateway API v1.5.0 enhancements
aws-mountpoint-s3-csi-driver v2.5.0-eksbuild.1
- Enhanced pod-termination ordering
aws-vpc-cni v1.21.1-eksbuild.7
cert-manager v1.20.2
eks-node-monitoring-agent v1.6.3-eksbuild.1
- tcpdump-based packet capture support and configurable probes/affinities
fluent-bit v5.0.3 (chart v0.57.3)
- Major version bump. Some http_server.* metric prefix renames; these are not used in our configuration
flux v2.8.6
gha-runner-scale-set v0.14.1
ingress-nginx v1.15.1 (chart v4.15.1)
karpenter v1.11.1
- Support for EC2 placement groups (required a new ec2:DescribePlacementGroups IAM permission on our Karpenter controller role)
- Includes the CPU-usage regression fix from v1.11.0
- The Karpenter IAM policy, SQS queue policy, and EventBridge rules have also been synced against the upstream CloudFormation reference, picking up ec2:DescribeInstanceStatus, an SQS DenyHTTP statement blocking non-TLS access, and a new EventBridge rule for Capacity Reservation Interruption warnings
kube-prometheus-stack v83.6.0 — bundled subcomponent highlights:
- Prometheus Operator v0.90.1 (from v0.89.0) — new schedulerName field on Prometheus/Alertmanager/PrometheusAgent/ThanosRuler CRDs, new --repair-policy-for-statefulsets flag recommended on Kubernetes 1.35+ to auto-recover pods stuck on a bad revision, and expanded Alertmanager config support (messageText for Slack, forceImplicitTLS for SMTP, global Telegram bot token)
- Alertmanager v0.32.0 (from v0.31.1) — silence annotations, silence logging, multi-matcher-set silences, full payload templating for the webhook notifier, new dict/map/append template functions, and Slack receivers can now edit existing messages
- Prometheus v3.11.2 (from v3.10.0) — security fix for a stored XSS in the UI (CVE-2026-40179), new pod-based Kubernetes SD labels for deployment/cronjob/job controller names, experimental histogram_quantiles PromQL function, new storage.tsdb.retention.percentage option, and a fix for alert state incorrectly resetting to pending when the FOR period is increased via config reload
- Grafana v12.4.3 (from v12.4.1, subchart 11.3.2 → 11.6.1) — patch bumps with security fixes for CVE-2026-27876, CVE-2026-27877, CVE-2026-28375, and CVE-2026-27879
coredns v1.13.2-eksbuild.7
kube-proxy v1.35.3-eksbuild.5
metrics-server v0.8.1-eksbuild.6
nvidia-device-plugin v0.19.0
oauth2-proxy v7.15.2 (chart v10.4.3)
prometheus-blackbox-exporter v0.28.0 (chart v11.9.1)
secrets-store-csi-driver-provider-aws v3.0.1
- Major version bump; the provider now requires explicit tokenRequests audiences to be declared since we install the CSI driver separately. This is handled transparently in our Helm values
tailscale v1.96.5
traefik v3.6.13 (chart v39.0.8)
velero v1.18.0 (chart v12.0.0)