Karpenter

Introduction

Karpenter is an open-source cluster autoscaler that automatically provisions new nodes in response to unschedulable pods. Karpenter evaluates the aggregate resource requirements of the pending pods and chooses the optimal instance type to run them. It will automatically scale-in or terminate instances that don’t have any non-daemonset pods to reduce waste. It also supports a consolidation feature which will actively move pods around and either delete or replace nodes with cheaper versions to reduce cluster cost.

Without Karpenter, Kubernetes users relied primarily on Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler (CAS) to dynamically adjust the compute capacity of their clusters. The downside here is that for each node group you are stuck with the same instance type across all nodes. This has as effect that not all nodes are optimally used based on the pods that are running on these nodes.

With Karpenter, depending on the cluster needs it will determine (within the constraints that were put up) what instance is best suited to accomodate the needs of the pods and will launch that node.

Pros and cons

Pros

  1. Efficiency: Karpenter is designed to efficiently pack pods onto nodes to minimize costs. It does this by considering the actual resource requirements of the pods, rather than just the requests.

  2. Scalability: Karpenter can rapidly scale up and down in response to workload changes. This makes it suitable for environments with highly variable workloads.

  3. Simplicity: Karpenter aims to be simpler to set up and manage than some other Kubernetes autoscalers. It integrates directly with Kubernetes Scheduling and doesn’t require a separate cluster autoscaler.

Cons

  1. Maturity: As of now, Karpenter is a relatively new project and may not have the same level of maturity or feature completeness as some other autoscalers.

  2. Potential for Overspending: While Karpenter can help reduce costs by packing pods efficiently, it can also lead to overspending if not configured correctly. For example, if it scales up too aggressively or if the pod resource requirements are set too high.

Note

Remember to evaluate Karpenter in the context of your specific needs and environment before deciding to use it.

How it works

Karpenter

Karpenter observes the aggregate resource requests of unscheduled pods and makes decisions to launch and terminate nodes to minimize scheduling latencies and infrastructure cost.

Infrastructure

It is not recommended to run Karpenter on a node that is managed by Karpenter. Therefore we opted to deploy Karpenter on Fargate nodes.

Karpenter launches EC2 instances directly from the AWS API, therefore you won’t be seeing an autoscaling group in AWS anymore for the NodePool that is provisioned for your instances. If you want to have a visual representation of the nodes and their usage we recommend using the eks-node-viewer command from AWS.

Pre-requisites

Due to the nature of Karpenter it can be quite agressive in killing pods to reach its desired cluster state. Therefore we need to make sure we set some safeguards to make sure the workloads running on the cluster are not affected in a negative way.

We recommend to follow these best practices. This is a general recommendation, but especially important when using Karpenter.

Usage

Concepts

Configuration

A Skyscrapers engineer can help you to enable Karpenter or you can update your cluster definition file, through pull request:

  karpenter:
    node_pools:
      default:
        node_class:
          extra_securitygroup_ids: []
          gpu_enabled: false
          public: false
          tags:
            team: myteam
          volumeSize: "100Gi"
        annotations:
          role: foo
        labels:
          role: foo
        limits:
          cpu: 111
          memory: "100Gi"
        requirements:
          - key: "karpenter.k8s.aws/instance-category"
            operator: In
            values: ["c", "m", "r"]
          - key: karpenter.k8s.aws/instance-generation
            operator: In
            values: ["5", "6"]
          - key: "karpenter.sh/capacity-type"
            operator: In
            values: ["spot", "on-demand"]
        taints:
          - effect: NoSchedule
            key: role
            value: foo

node_pools is a dictionary that defines the configuration for different pools of nodes. In this case, there is one node pool named default.

Under default, there are several properties that define the configuration of the nodes in this pool:

  • node_class (optional): defines the properties of the nodes, such as whether they have a GPU (gpu_enabled), whether they are public (public), their volume size (volumeSize), etc.

  • annotations (optional): are key-value pairs that can be used to attach arbitrary non-identifying metadata to nodes.

  • labels (required): are key-value pairs that can be used to select nodes for scheduling pods.

  • limits (required): define the maximum amount of CPU and memory that can be used by the nodes. This is important to set to not cause unexpected costs!

  • requirements (required): define the conditions that must be met for a node to be included in the node pool. See also the requirements section below.

  • taints (optional): are used to repel pods from being scheduled on the nodes. In this case, pods that do not tolerate the taint role: foo will not be scheduled on the nodes.

Defining requirements

        requirements:
          - key: "karpenter.k8s.aws/instance-category"
            operator: In
            values: ["c", "m", "r"]
          - key: karpenter.k8s.aws/instance-generation
            operator: Gt
            values: ["4"]
          - key: "topology.kubernetes.io/zone"
            operator: In
            values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
          - key: "kubernetes.io/arch"
            operator: In
            values: ["amd64"]
          - key: "karpenter.sh/capacity-type"
            operator: In
            values: ["on-demand"]
          - key: node.kubernetes.io/instance-type
            operator: NotIn
            values: ["m5a.16xlarge", "m5a.24xlarge"]

This YAML snippet is a part of a Karpenter configuration file. It defines a list of requirements that nodes must meet to be included in a specific node pool. Each requirement is defined by a key, an operator, and a list of values. All requirements in this list are optional and should be set to limit the options of Karpenter.

The above example explained:

  1. specifies the instance-category of the node, as defined by the key karpenter.k8s.aws/instance-category, must be either “c”, “m”, or “r”. The In operator means the actual value must be in the provided list.

  2. specifies the instance-generation of the node, as defined by the key karpenter.k8s.aws/instance-generation, must be greater than “4”. The Gt operator stands for “greater than”.

  3. specifies the zone of the node, as defined by the key topology.kubernetes.io/zone, must be either “eu-west-1a”, “eu-west-1b”, or “eu-west-1c”.

  4. specifies the architecture of the node, as defined by the key kubernetes.io/arch, must be “amd64”.

  5. specifies the capacity-type of the node, as defined by the key karpenter.sh/capacity-type, must be “on-demand”.

  6. specifies the instance-type of the node, as defined by the key node.kubernetes.io/instance-type, must not be “m5a.16xlarge” or “m5a.24xlarge”. The NotIn operator means the actual value must not be in the provided list.

In summary, these requirements define the characteristics that nodes must have or not have to be included in the node pool.

Important

At this moment we recommend to excluding the 7th generation of instances by default as that generation has an increased price over the other generations!

How to deal with node management

Once Karpenter detects a change in its NodePool(s) it will automatically take action to reach that desired state. Some examples:

  • A new AMI is published: Karpenter will take action and rotate all nodes to the new AMI version.
  • A change in the requirements is published: Karpenter will take action so all nodes match with the requirements.

Understanding Disruptions

Karpenter disrupts nodes for various reasons. Understanding the difference between voluntary and involuntary disruptions is critical to configuring your workloads correctly.

Involuntary Disruptions (Cannot Be Prevented)

These happen regardless of any Karpenter, PDB, or budget configuration:

1. Spot Interruptions

  • AWS reclaims the spot instance with a 2-minute warning
  • Cannot be prevented or delayed — this is the spot contract
  • Mitigation: maximum instance type diversity (15+ types); PDBs ensure replacement capacity before the old pod terminates

2. Node Health Issues

  • Node becomes NotReady, hardware failure, kubelet crashes
  • Cannot be prevented or delayed
  • Mitigation: multi-zone deployments, proper resource requests to prevent node resource exhaustion

3. Instance Termination Events

  • AWS-initiated maintenance, retirement, or stop
  • Cannot be prevented or delayed — AWS controls timing

Voluntary Disruptions (Karpenter-Controlled)

These are initiated by Karpenter and have configurable levels of control:

1. Consolidation — Empty Nodes

Removes nodes with no running pods after consolidateAfter duration.

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 5m

Can be prevented via: consolidationPolicy: WhenEmpty (never consolidates underutilized), disruption budgets, keeping a pod on the node, or karpenter.sh/do-not-disrupt: "true" annotation.

Respects PDBs | Respects disruption budgets

2. Consolidation — Underutilized Nodes

Actively packs pods onto fewer nodes when utilization falls below ~50%.

disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 1m

Can be prevented via: consolidationPolicy: WhenEmpty, disruption budgets with reasons: ["Underutilized"], or karpenter.sh/do-not-disrupt: "true" on pods.

Respects PDBs | Respects disruption budgets

3. Drift — Configuration Changes

Not all configuration changes cause drift. Here’s the distinction:

Changes that CAUSE drift (nodes become “drifted”):

  • Instance type removed from NodePool requirements
  • Changes to NodePool taints or labels
  • Changing subnetSelectorTerms or securityGroupSelectorTerms
  • AMI changes (e.g., updating amiSelectorTerms)
  • Block device mapping changes
  • UserData changes in EC2NodeClass

Changes that DO NOT cause drift:

  • Adding instance types to requirements (existing nodes are still valid)
  • Expanding instance family list
  • Changing NodePool limits
  • Changing consolidationPolicy or consolidateAfter
  • Changing disruption budgets

Warning

Disruption budgets can delay drift but cannot prevent it — drifted nodes will eventually be replaced.

Respects PDBs | Respects disruption budgets (partially — timing only)

4. Expiration — Node Age

Nodes are removed after reaching expireAfter. This is intentional: it ensures nodes are always running the latest patched AMIs and reduces attack surface.

disruption:
  expireAfter: 168h  # 7 days
  # expireAfter: Never  # Disables expiration — not recommended

Expired nodes are made unschedulable and drained. Misconfigured PDBs may delay this indefinitely until terminationGracePeriod is exceeded, at which point the node is forcefully terminated.

Respects PDBs | Respects disruption budgets (timing only)

We set our defaults to expire the node after 30 days.

NodePool Disruption Budgets — What They Do and Don’t Do

What disruption budgets DO:

  • Control the rate of disruptions (max N nodes at once)
  • Control the timing of disruptions (schedule-based)
  • Delay disruptions to appropriate maintenance windows
  • Prevent consolidation entirely (if desired)

What disruption budgets DO NOT do:

  • Prevent drift indefinitely
  • Prevent expiration indefinitely
  • Override involuntary disruptions (spot, health, AWS events)
  • Prevent security-critical updates
Disruption TypePreventable?
ConsolidationCan prevent forever
DriftCan delay, but node WILL be replaced
ExpirationCan delay, but node WILL be replaced
Spot / HealthCannot delay at all

How PodDisruptionBudgets Interact with Karpenter

During consolidation:

  1. Karpenter: “I want to drain this node”
  2. Checks PDB: “Can I disrupt 1 pod?”
  3. If yes → drains pod, waits for it to reschedule
  4. If no → waits, tries again later

During drift/expiration: Same process as consolidation. Respects PDB during drain, but the disruption WILL happen (just delayed).

During spot interruption: 2-minute warning from AWS. Karpenter immediately starts draining. PDB is checked, but if drain takes >2 min, the node terminates anyway — this is why graceful shutdown must complete in <30s.

PDB Best Practices:

# Good — allows consolidation
spec:
  maxUnavailable: 1       # for 3–10 replicas
  # or
  maxUnavailable: "20%"   # for larger deployments

# Good — maintains minimum capacity
spec:
  minAvailable: 2         # works for 5+ replicas

Warning

Avoid setting minAvailable equal to the replica count (e.g., minAvailable: 5 with exactly 5 replicas). This blocks all disruption and can prevent node updates indefinitely.

Cost Optimisation: The Two Pillars

When configuring Karpenter you’re constantly balancing:

Cost OptimisationHigh Availability
Aggressive consolidationMinimize disruptions
Tight bin-packingExtra capacity
100% Spot instancesOn-Demand fallback
Scale to zeroWarm pools
Short node lifetimeLong-lived nodes

Most organisations lean heavily towards high availability by limiting consolidation, leaving significant savings on the table. Here’s an example for a 50-node cluster at ~$20,000/month EC2 cost:

ConfigurationAvg UtilisationMonthly Costvs Optimised
WhenEmpty with long consolidateAfter35%$20,000Baseline
WhenEmpty with short consolidateAfter45%$16,000-20%
WhenEmptyOrUnderutilized with conservative PDBs60%$12,000-40%
WhenEmptyOrUnderutilized with optimised PDBs75%$8,000-60%

Why Organisations Limit Consolidation (and Why They Shouldn’t)

  1. “Consolidation will cause downtime” — With proper PDBs and health checks, consolidation should be zero downtime. Karpenter waits for pods to be ready before continuing.
  2. “There was an incident once during consolidation” — Usually caused by missing PDBs or bad health checks, not consolidation. Fix the root cause.
  3. “The apps are too fragile” — Apps must handle graceful pod restarts. Nodes can die from hardware failure, AWS maintenance, or platform rollouts — it is unavoidable.
  4. “It’s too much complexity” — The complexity is already there; you’re compensating with cost. Investment in proper workload design pays dividends.

Advanced Topics

These techniques address specific scenarios where the standard patterns need to be extended. Use them when you have a concrete reason, not as defaults.

Warm Pools (Pre-provisioned Capacity)

Problem: When a scheduled job fires or a traffic spike hits, Karpenter needs time to provision a new node (typically 60–120s). If your workload can’t wait, you need nodes already warm.

Solution: Deploy a low-priority placeholder Deployment that keeps nodes running. When a real workload needs the capacity, Kubernetes preempts the placeholder and the node is immediately available.

# 1. Define a low-priority class for the placeholder
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: warm-pool-priority
value: -10
globalDefault: false
description: "Low priority for warm pool placeholders"
---
# 2. Deploy the placeholder (uses pause container — near-zero overhead)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: warm-pool-placeholder
spec:
  replicas: 3  # One placeholder per node you want to keep warm
  selector:
    matchLabels:
      app: warm-pool
  template:
    metadata:
      labels:
        app: warm-pool
    spec:
      priorityClassName: warm-pool-priority

      # Force each placeholder pod onto a separate node
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: warm-pool
              topologyKey: kubernetes.io/hostname

      nodeSelector:
        karpenter.sh/nodepool: general

      tolerations:
      - key: role
        operator: Equal
        value: general
        effect: NoSchedule

      containers:
      - name: placeholder
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
        resources:
          requests:
            cpu: 1000m    # Size this to match the real workload's footprint
            memory: 1500Mi

Warning

Size the placeholder resource requests to match what the real workload needs. If the placeholder is too small, the real pod may not fit on the warmed node.

Static Capacity Mode

Warning

This is an experimental feature that is still in development and may contain bugs

If you want to activate static capacity mode for Karpenter and disable node autoscaling.

karpenter:
  featureGates:
    staticCapacity: true # has to be enabled
  nodepools:
    nodepool-name:
      # ...rest of nodepool spec...
      replicas: 2  # Always maintains 2 nodes in this nodepool, no autoscaling
      limits:
        nodes: 4 # Maximum nodes during manual scaling/drift

Warning

When specified, the NodePool maintains a fixed number of nodes regardless of pod demand. Take into account that this disables Karpenter’s dynamic scaling for this NodePool! Once defined, you can not change back to Dynamic mode for this NodePool. In static mode you also need to specify spec.limits.nodes. IMPORTANT: Needs the featureGates.staticCapacity flag to be enabled in our main karpenter config.

Check out the relevant Karpenter docs for more information.

Coordinated Disruption Budgets

Problem: You want different disruption behaviour depending on the time of day, day of week, and disruption reason (e.g. consolidation vs drift vs expiration).

Solution: Stack multiple budget entries with reasons to get precise control:

karpenter:
  nodepools:
    nodepool-name:
      disruption:
        consolidationPolicy: WhenEmptyOrUnderutilized
        consolidateAfter: 2m

        budgets:
          # 1. No consolidation during business hours
          - nodes: "0"
            schedule: "0 9 * * mon-fri"
            duration: 8h
            reasons: ["Underutilized", "Empty"]

          # 2. Allow drift/AMI updates at night (controlled rate)
          - nodes: "30%"
            schedule: "0 2 * * *"
            duration: 4h
            reasons: ["Drifted"]

          # 3. Aggressive consolidation over the weekend
          - nodes: "100%"
            schedule: "0 0 * * sat"
            duration: 48h
            reasons: ["Underutilized", "Empty"]

What each entry does:

  • Entry 1: Freezes all consolidation Monday–Friday 09:00–17:00 — zero voluntary disruptions during production hours
  • Entry 2: Allows up to 30% of nodes to be replaced for drift/AMI updates at 2am — keeps AMIs fresh without business impact
  • Entry 3: Fully opens consolidation on weekends so the cluster can repack tightly after a busy week

Warning

Budgets with reasons only match those specific disruption types. If you omit reasons, the budget applies to all voluntary disruptions.

Common Misunderstandings

“Disruption budgets prevent all disruptions” — They control timing and rate, not whether disruption happens.

“Adding instance types causes drift” — Only REMOVING instance types or changing other configs causes drift.

“PDBs protect against spot interruptions” — PDBs are respected during drain, but if drain takes >2 min the node terminates anyway.

“Setting expireAfter: Never means nodes never update” — Drift (AMI changes) can still replace nodes.

"karpenter.sh/do-not-disrupt prevents spot interruptions" — Only prevents voluntary Karpenter disruptions, not AWS-initiated events.

Troubleshooting

Pod Stuck in Pending

kubectl -n <namespace> describe pod <pod-name>

Scroll to the Events section at the bottom. Karpenter will have emitted an event explaining why the pod is unschedulable. Common causes:

1. NodePool limits reached

  • Symptom: maximum limit on nodepool reached
  • Fix: Increase the cpu or memory limits on the relevant nodepool in your cluster definition

2. Typo in nodeSelector or tolerations

  • Symptom: Pod never matches any nodepool
  • Fix: Verify the nodeSelector key/value matches the nodepool labels, and that tolerations match the nodepool taints
# Inspect a nodepool's labels and taints
kubectl get nodepool <nodepool-name> -o yaml

3. Karpenter cannot provision a node with the requirements

  • Symptom: Karpenter logs show no instance types available
  • Fix: Ensure you have sufficient instance type flexibility (especially for spot-only nodepools). Check Karpenter logs and default namespace events:
# Karpenter controller logs
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter

# Events in default namespace
kubectl get events -n default --sort-by='.lastTimestamp'

Check Instance Types Running in Your Cluster

kubectl get nodes -l karpenter.sh/initialized=true \
  -o jsonpath='{range .items[*]}{.metadata.labels.node\.kubernetes\.io/instance-type}{"\n"}{end}' | \
  sort | uniq -c

Use this to verify that spot diversity is healthy and that no single instance type dominates.

Check for Overprovisioned Nodes

kubectl top nodes

Nodes consistently below 30–40% CPU/memory utilisation are candidates for consolidation tuning. If you’re seeing persistent low utilization, check whether consolidation is being blocked by:

  • consolidationPolicy: WhenEmpty instead of WhenEmptyOrUnderutilized
  • Disruption budgets that are too restrictive
  • PDBs with minAvailable set equal to replica count

Review Resource Requests Across Pods

# List all pods with their CPU and memory requests
kubectl get pods -A -o json | \
  jq -r '.items[] | .spec.containers[] |
  "\(.name): cpu=\(.resources.requests.cpu) mem=\(.resources.requests.memory)"'

Look for:

  • Pods with no requests set at all (they’ll be scheduled as BestEffort and are first to be evicted)
  • Severely underprovisioned requests (leads to OOMKills during bin-packing)
  • Severely overprovisioned requests (wastes node capacity, prevents consolidation)

Note

Rule of thumb: set requests at P90 actual usage + 20% buffer. Use kubectl top pod over 1–2 weeks to gather baseline data.

Grafana Dashboard

Skyscrapers provisions the upstream Karpenter Grafana dashboard scraped via Prometheus. It gives you:

  • % of Spot vs On-Demand capacity per nodepool
  • Reconciliation actions (consolidation, drift, expiration) with reasons
  • Node churn over time
  • Instance type distribution

The dashboard can be scoped to a specific nodepool using the nodepool selector at the top. Use it to validate that consolidation is running as expected and to catch any unexpected disruption spikes.

FailedScheduling pods with PV claim

On some old PVs it can be that there are no nodeAffinity rules set. Karpenter therefore doesn’t know to what zone it needs to map that volume to schedule that pod. In order to fix this add the following to your PV:

  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/region
          operator: In
          values:
          - <your-region>
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - <your-region><your-pv-zone-id>

Important

Using the proper topology constraints is very important but more so with Karpenter. Our CSI driver should do this automatically for new volumes, but older ones might not have it.

Karpenter Controller Logs

Karpenter itself can be debugged by looking at the logs of the containers. This can either be done in CloudWatch logs (in the /aws/eks/<cluster_name>/pods group) or through the CLI:

kubectl -n kube-system logs -f deployment/karpenter

When nodes act up, node debugging can be done with SSM. Nodes can be removed by running kubectl delete node ip-x-x-x-x.eu-west-1.compute.internal.

More in depth documentation on how to troubleshoot Karpenter can be found here.

Useful One-Liners

# List all nodepools and their current node counts
kubectl get nodepools

# Check disruption budget status
kubectl get nodepools -o json | jq '.items[] | {name: .metadata.name, budgets: .spec.disruption.budgets}'

# Find nodes that are expired or drifted
kubectl get nodes -l karpenter.sh/initialized=true -o json | \
  jq '.items[] | select(.metadata.annotations["karpenter.sh/nodepool"] != null) | \
  {name: .metadata.name, disruption: .metadata.annotations}'

# Watch Karpenter decisions in real time
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter | grep -E '(launch|terminate|consolidat|drift|expire)'

Reference Documentation

Last updated on