Earlier alerting on nodes that can't run their pods

Occasionally a node joins a cluster and reports itself Ready, yet can’t actually run the pods scheduled onto it: its networking never finishes initialising (for example a VPC CNI problem at startup), so every pod placed there stays stuck in Pending. Because the node looks healthy, the scheduler keeps sending it work, turning it into a black hole until someone notices.

We’ve added a new platform alert, NodePodsStuckPending, for exactly this: when several pods pile up Pending on a single node for a few minutes, it fires so the node can be cordoned and replaced quickly.

Why this helps

Karpenter’s automatic node repair already replaces genuinely unhealthy nodes, but it waits 30 minutes before acting (to avoid reacting to transient blips). This alert surfaces the problem earlier, typically within about 10 minutes, so a stuck node can be dealt with well before the auto-repair window and the impact on your workloads is kept short.

What you need to do

Nothing. The alert is part of the platform and rolls out across clusters automatically.

Resources

NodePodsStuckPending runbook