Karpenter
Introduction
Karpenter is an open-source cluster autoscaler that automatically provisions new nodes in response to unschedulable pods. Karpenter evaluates the aggregate resource requirements of the pending pods and chooses the optimal instance type to run them. It will automatically scale-in or terminate instances that don’t have any non-daemonset pods to reduce waste. It also supports a consolidation feature which will actively move pods around and either delete or replace nodes with cheaper versions to reduce cluster cost.
Without Karpenter, Kubernetes users relied primarily on Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler (CAS) to dynamically adjust the compute capacity of their clusters. The downside here is that for each node group you are stuck with the same instance type across all nodes. This has as effect that not all nodes are optimally used based on the pods that are running on these nodes.
With Karpenter, depending on the cluster needs it will determine (within the constraints that were put up) what instance is best suited to accomodate the needs of the pods and will launch that node.
Pros and cons
Pros
Efficiency: Karpenter is designed to efficiently pack pods onto nodes to minimize costs. It does this by considering the actual resource requirements of the pods, rather than just the requests.
Scalability: Karpenter can rapidly scale up and down in response to workload changes. This makes it suitable for environments with highly variable workloads.
Simplicity: Karpenter aims to be simpler to set up and manage than some other Kubernetes autoscalers. It integrates directly with Kubernetes Scheduling and doesn’t require a separate cluster autoscaler.
Cons
Maturity: As of now, Karpenter is a relatively new project and may not have the same level of maturity or feature completeness as some other autoscalers.
Potential for Overspending: While Karpenter can help reduce costs by packing pods efficiently, it can also lead to overspending if not configured correctly. For example, if it scales up too aggressively or if the pod resource requirements are set too high.
Note
Remember to evaluate Karpenter in the context of your specific needs and environment before deciding to use it.
How it works

Karpenter observes the aggregate resource requests of unscheduled pods and makes decisions to launch and terminate nodes to minimize scheduling latencies and infrastructure cost.
Infrastructure
It is not recommended to run Karpenter on a node that is managed by Karpenter. Therefore we opted to deploy Karpenter on Fargate nodes.
Karpenter launches EC2 instances directly from the AWS API, therefore you won’t be seeing an autoscaling group in AWS anymore for the NodePool that is provisioned for your instances. If you want to have a visual representation of the nodes and their usage we recommend using the eks-node-viewer command from AWS.
Pre-requisites
Due to the nature of Karpenter it can be quite agressive in killing pods to reach its desired cluster state. Therefore we need to make sure we set some safeguards to make sure the workloads running on the cluster are not affected in a negative way.
We recommend to follow these best practices. This is a general recommendation, but especially important when using Karpenter.
- Use topologySpreadConstraints in your deployments
- Use PodDisruptionBudget in your deployments
- Have nodeAffinity rules set on PVs
- Have do-not-disrupt annotations set on pods that may not be evicted
- Have requests=limits on Memory resources configured
Usage
Concepts
Configuration
A Skyscrapers engineer can help you to enable Karpenter or you can update your cluster definition file, through pull request:
karpenter:
node_pools:
default:
node_class:
extra_securitygroup_ids: []
gpu_enabled: false
public: false
tags:
team: myteam
volumeSize: "100Gi"
annotations:
role: foo
labels:
role: foo
limits:
cpu: 111
memory: "100Gi"
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: In
values: ["5", "6"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
taints:
- effect: NoSchedule
key: role
value: foonode_pools is a dictionary that defines the configuration for different pools of nodes. In this case, there is one node pool named default.
Under default, there are several properties that define the configuration of the nodes in this pool:
node_class(optional): defines the properties of the nodes, such as whether they have a GPU (gpu_enabled), whether they are public (public), their volume size (volumeSize), etc.annotations(optional): are key-value pairs that can be used to attach arbitrary non-identifying metadata to nodes.labels(required): are key-value pairs that can be used to select nodes for scheduling pods.limits(required): define the maximum amount of CPU and memory that can be used by the nodes. This is important to set to not cause unexpected costs!requirements(required): define the conditions that must be met for a node to be included in the node pool. See also the requirements section below.taints(optional): are used to repel pods from being scheduled on the nodes. In this case, pods that do not tolerate the taintrole: foowill not be scheduled on the nodes.
Defining requirements
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
- key: "topology.kubernetes.io/zone"
operator: In
values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: NotIn
values: ["m5a.16xlarge", "m5a.24xlarge"]This YAML snippet is a part of a Karpenter configuration file. It defines a list of requirements that nodes must meet to be included in a specific node pool. Each requirement is defined by a key, an operator, and a list of values. All requirements in this list are optional and should be set to limit the options of Karpenter.
The above example explained:
specifies the
instance-categoryof the node, as defined by the keykarpenter.k8s.aws/instance-category, must be either “c”, “m”, or “r”. TheInoperator means the actual value must be in the provided list.specifies the
instance-generationof the node, as defined by the keykarpenter.k8s.aws/instance-generation, must be greater than “4”. TheGtoperator stands for “greater than”.specifies the
zoneof the node, as defined by the keytopology.kubernetes.io/zone, must be either “eu-west-1a”, “eu-west-1b”, or “eu-west-1c”.specifies the architecture of the node, as defined by the key
kubernetes.io/arch, must be “amd64”.specifies the
capacity-typeof the node, as defined by the keykarpenter.sh/capacity-type, must be “on-demand”.specifies the
instance-typeof the node, as defined by the keynode.kubernetes.io/instance-type, must not be “m5a.16xlarge” or “m5a.24xlarge”. TheNotInoperator means the actual value must not be in the provided list.
In summary, these requirements define the characteristics that nodes must have or not have to be included in the node pool.
Important
At this moment we recommend to excluding the 7th generation of instances by default as that generation has an increased price over the other generations!
How to deal with node management
Once Karpenter detects a change in its NodePool(s) it will automatically take action to reach that desired state. Some examples:
- A new AMI is published: Karpenter will take action and rotate all nodes to the new AMI version.
- A change in the requirements is published: Karpenter will take action so all nodes match with the requirements.
Understanding Disruptions
Karpenter disrupts nodes for various reasons. Understanding the difference between voluntary and involuntary disruptions is critical to configuring your workloads correctly.
Involuntary Disruptions (Cannot Be Prevented)
These happen regardless of any Karpenter, PDB, or budget configuration:
1. Spot Interruptions
- AWS reclaims the spot instance with a 2-minute warning
- Cannot be prevented or delayed — this is the spot contract
- Mitigation: maximum instance type diversity (15+ types); PDBs ensure replacement capacity before the old pod terminates
2. Node Health Issues
- Node becomes
NotReady, hardware failure, kubelet crashes - Cannot be prevented or delayed
- Mitigation: multi-zone deployments, proper resource requests to prevent node resource exhaustion
3. Instance Termination Events
- AWS-initiated maintenance, retirement, or stop
- Cannot be prevented or delayed — AWS controls timing
Voluntary Disruptions (Karpenter-Controlled)
These are initiated by Karpenter and have configurable levels of control:
1. Consolidation — Empty Nodes
Removes nodes with no running pods after consolidateAfter duration.
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5mCan be prevented via: consolidationPolicy: WhenEmpty (never consolidates underutilized), disruption budgets, keeping a pod on the node, or karpenter.sh/do-not-disrupt: "true" annotation.
Respects PDBs | Respects disruption budgets
2. Consolidation — Underutilized Nodes
Actively packs pods onto fewer nodes when utilization falls below ~50%.
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1mCan be prevented via: consolidationPolicy: WhenEmpty, disruption budgets with reasons: ["Underutilized"], or karpenter.sh/do-not-disrupt: "true" on pods.
Respects PDBs | Respects disruption budgets
3. Drift — Configuration Changes
Not all configuration changes cause drift. Here’s the distinction:
Changes that CAUSE drift (nodes become “drifted”):
- Instance type removed from NodePool requirements
- Changes to NodePool
taintsorlabels - Changing
subnetSelectorTermsorsecurityGroupSelectorTerms - AMI changes (e.g., updating
amiSelectorTerms) - Block device mapping changes
- UserData changes in
EC2NodeClass
Changes that DO NOT cause drift:
- Adding instance types to requirements (existing nodes are still valid)
- Expanding instance family list
- Changing NodePool
limits - Changing
consolidationPolicyorconsolidateAfter - Changing disruption budgets
Warning
Disruption budgets can delay drift but cannot prevent it — drifted nodes will eventually be replaced.
Respects PDBs | Respects disruption budgets (partially — timing only)
4. Expiration — Node Age
Nodes are removed after reaching expireAfter. This is intentional: it ensures nodes are always running the latest patched AMIs and reduces attack surface.
disruption:
expireAfter: 168h # 7 days
# expireAfter: Never # Disables expiration — not recommendedExpired nodes are made unschedulable and drained. Misconfigured PDBs may delay this indefinitely until terminationGracePeriod is exceeded, at which point the node is forcefully terminated.
Respects PDBs | Respects disruption budgets (timing only)
We set our defaults to expire the node after 30 days.
NodePool Disruption Budgets — What They Do and Don’t Do
What disruption budgets DO:
- Control the rate of disruptions (max N nodes at once)
- Control the timing of disruptions (schedule-based)
- Delay disruptions to appropriate maintenance windows
- Prevent consolidation entirely (if desired)
What disruption budgets DO NOT do:
- Prevent drift indefinitely
- Prevent expiration indefinitely
- Override involuntary disruptions (spot, health, AWS events)
- Prevent security-critical updates
| Disruption Type | Preventable? |
|---|---|
| Consolidation | Can prevent forever |
| Drift | Can delay, but node WILL be replaced |
| Expiration | Can delay, but node WILL be replaced |
| Spot / Health | Cannot delay at all |
How PodDisruptionBudgets Interact with Karpenter
During consolidation:
- Karpenter: “I want to drain this node”
- Checks PDB: “Can I disrupt 1 pod?”
- If yes → drains pod, waits for it to reschedule
- If no → waits, tries again later
During drift/expiration: Same process as consolidation. Respects PDB during drain, but the disruption WILL happen (just delayed).
During spot interruption: 2-minute warning from AWS. Karpenter immediately starts draining. PDB is checked, but if drain takes >2 min, the node terminates anyway — this is why graceful shutdown must complete in <30s.
PDB Best Practices:
# Good — allows consolidation
spec:
maxUnavailable: 1 # for 3–10 replicas
# or
maxUnavailable: "20%" # for larger deployments
# Good — maintains minimum capacity
spec:
minAvailable: 2 # works for 5+ replicasWarning
Avoid setting minAvailable equal to the replica count (e.g., minAvailable: 5 with exactly 5 replicas). This blocks all disruption and can prevent node updates indefinitely.
Cost Optimisation: The Two Pillars
When configuring Karpenter you’re constantly balancing:
| Cost Optimisation | High Availability |
|---|---|
| Aggressive consolidation | Minimize disruptions |
| Tight bin-packing | Extra capacity |
| 100% Spot instances | On-Demand fallback |
| Scale to zero | Warm pools |
| Short node lifetime | Long-lived nodes |
Most organisations lean heavily towards high availability by limiting consolidation, leaving significant savings on the table. Here’s an example for a 50-node cluster at ~$20,000/month EC2 cost:
| Configuration | Avg Utilisation | Monthly Cost | vs Optimised |
|---|---|---|---|
WhenEmpty with long consolidateAfter | 35% | $20,000 | Baseline |
WhenEmpty with short consolidateAfter | 45% | $16,000 | -20% |
WhenEmptyOrUnderutilized with conservative PDBs | 60% | $12,000 | -40% |
WhenEmptyOrUnderutilized with optimised PDBs | 75% | $8,000 | -60% |
Why Organisations Limit Consolidation (and Why They Shouldn’t)
- “Consolidation will cause downtime” — With proper PDBs and health checks, consolidation should be zero downtime. Karpenter waits for pods to be ready before continuing.
- “There was an incident once during consolidation” — Usually caused by missing PDBs or bad health checks, not consolidation. Fix the root cause.
- “The apps are too fragile” — Apps must handle graceful pod restarts. Nodes can die from hardware failure, AWS maintenance, or platform rollouts — it is unavoidable.
- “It’s too much complexity” — The complexity is already there; you’re compensating with cost. Investment in proper workload design pays dividends.
Advanced Topics
These techniques address specific scenarios where the standard patterns need to be extended. Use them when you have a concrete reason, not as defaults.
Warm Pools (Pre-provisioned Capacity)
Problem: When a scheduled job fires or a traffic spike hits, Karpenter needs time to provision a new node (typically 60–120s). If your workload can’t wait, you need nodes already warm.
Solution: Deploy a low-priority placeholder Deployment that keeps nodes running. When a real workload needs the capacity, Kubernetes preempts the placeholder and the node is immediately available.
# 1. Define a low-priority class for the placeholder
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: warm-pool-priority
value: -10
globalDefault: false
description: "Low priority for warm pool placeholders"
---
# 2. Deploy the placeholder (uses pause container — near-zero overhead)
apiVersion: apps/v1
kind: Deployment
metadata:
name: warm-pool-placeholder
spec:
replicas: 3 # One placeholder per node you want to keep warm
selector:
matchLabels:
app: warm-pool
template:
metadata:
labels:
app: warm-pool
spec:
priorityClassName: warm-pool-priority
# Force each placeholder pod onto a separate node
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: warm-pool
topologyKey: kubernetes.io/hostname
nodeSelector:
karpenter.sh/nodepool: general
tolerations:
- key: role
operator: Equal
value: general
effect: NoSchedule
containers:
- name: placeholder
image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
resources:
requests:
cpu: 1000m # Size this to match the real workload's footprint
memory: 1500MiWarning
Size the placeholder resource requests to match what the real workload needs. If the placeholder is too small, the real pod may not fit on the warmed node.
Static Capacity Mode
Warning
This is an experimental feature that is still in development and may contain bugs
If you want to activate static capacity mode for Karpenter and disable node autoscaling.
karpenter:
featureGates:
staticCapacity: true # has to be enabled
nodepools:
nodepool-name:
# ...rest of nodepool spec...
replicas: 2 # Always maintains 2 nodes in this nodepool, no autoscaling
limits:
nodes: 4 # Maximum nodes during manual scaling/driftWarning
When specified, the NodePool maintains a fixed number of nodes regardless of pod demand. Take into account that this disables Karpenter’s dynamic scaling for this NodePool! Once defined, you can not change back to Dynamic mode for this NodePool. In static mode you also need to specify spec.limits.nodes. IMPORTANT: Needs the featureGates.staticCapacity flag to be enabled in our main karpenter config.
Check out the relevant Karpenter docs for more information.
Coordinated Disruption Budgets
Problem: You want different disruption behaviour depending on the time of day, day of week, and disruption reason (e.g. consolidation vs drift vs expiration).
Solution: Stack multiple budget entries with reasons to get precise control:
karpenter:
nodepools:
nodepool-name:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 2m
budgets:
# 1. No consolidation during business hours
- nodes: "0"
schedule: "0 9 * * mon-fri"
duration: 8h
reasons: ["Underutilized", "Empty"]
# 2. Allow drift/AMI updates at night (controlled rate)
- nodes: "30%"
schedule: "0 2 * * *"
duration: 4h
reasons: ["Drifted"]
# 3. Aggressive consolidation over the weekend
- nodes: "100%"
schedule: "0 0 * * sat"
duration: 48h
reasons: ["Underutilized", "Empty"]What each entry does:
- Entry 1: Freezes all consolidation Monday–Friday 09:00–17:00 — zero voluntary disruptions during production hours
- Entry 2: Allows up to 30% of nodes to be replaced for drift/AMI updates at 2am — keeps AMIs fresh without business impact
- Entry 3: Fully opens consolidation on weekends so the cluster can repack tightly after a busy week
Warning
Budgets with reasons only match those specific disruption types. If you omit reasons, the budget applies to all voluntary disruptions.
Common Misunderstandings
“Disruption budgets prevent all disruptions” — They control timing and rate, not whether disruption happens.
“Adding instance types causes drift” — Only REMOVING instance types or changing other configs causes drift.
“PDBs protect against spot interruptions” — PDBs are respected during drain, but if drain takes >2 min the node terminates anyway.
“Setting expireAfter: Never means nodes never update” — Drift (AMI changes) can still replace nodes.
"karpenter.sh/do-not-disrupt prevents spot interruptions" — Only prevents voluntary Karpenter disruptions, not AWS-initiated events.
Troubleshooting
Pod Stuck in Pending
kubectl -n <namespace> describe pod <pod-name>Scroll to the Events section at the bottom. Karpenter will have emitted an event explaining why the pod is unschedulable. Common causes:
1. NodePool limits reached
- Symptom:
maximum limit on nodepool reached - Fix: Increase the
cpuormemorylimits on the relevant nodepool in your cluster definition
2. Typo in nodeSelector or tolerations
- Symptom: Pod never matches any nodepool
- Fix: Verify the
nodeSelectorkey/value matches the nodepoollabels, and thattolerationsmatch the nodepooltaints
# Inspect a nodepool's labels and taints
kubectl get nodepool <nodepool-name> -o yaml3. Karpenter cannot provision a node with the requirements
- Symptom: Karpenter logs show no instance types available
- Fix: Ensure you have sufficient instance type flexibility (especially for spot-only nodepools). Check Karpenter logs and default namespace events:
# Karpenter controller logs
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter
# Events in default namespace
kubectl get events -n default --sort-by='.lastTimestamp'Check Instance Types Running in Your Cluster
kubectl get nodes -l karpenter.sh/initialized=true \
-o jsonpath='{range .items[*]}{.metadata.labels.node\.kubernetes\.io/instance-type}{"\n"}{end}' | \
sort | uniq -cUse this to verify that spot diversity is healthy and that no single instance type dominates.
Check for Overprovisioned Nodes
kubectl top nodesNodes consistently below 30–40% CPU/memory utilisation are candidates for consolidation tuning. If you’re seeing persistent low utilization, check whether consolidation is being blocked by:
consolidationPolicy: WhenEmptyinstead ofWhenEmptyOrUnderutilized- Disruption budgets that are too restrictive
- PDBs with
minAvailableset equal to replica count
Review Resource Requests Across Pods
# List all pods with their CPU and memory requests
kubectl get pods -A -o json | \
jq -r '.items[] | .spec.containers[] |
"\(.name): cpu=\(.resources.requests.cpu) mem=\(.resources.requests.memory)"'Look for:
- Pods with no requests set at all (they’ll be scheduled as
BestEffortand are first to be evicted) - Severely underprovisioned requests (leads to OOMKills during bin-packing)
- Severely overprovisioned requests (wastes node capacity, prevents consolidation)
Note
Rule of thumb: set requests at P90 actual usage + 20% buffer. Use kubectl top pod over 1–2 weeks to gather baseline data.
Grafana Dashboard
Skyscrapers provisions the upstream Karpenter Grafana dashboard scraped via Prometheus. It gives you:
- % of Spot vs On-Demand capacity per nodepool
- Reconciliation actions (consolidation, drift, expiration) with reasons
- Node churn over time
- Instance type distribution
The dashboard can be scoped to a specific nodepool using the nodepool selector at the top. Use it to validate that consolidation is running as expected and to catch any unexpected disruption spikes.
FailedScheduling pods with PV claim
On some old PVs it can be that there are no nodeAffinity rules set. Karpenter therefore doesn’t know to what zone it needs to map that volume to schedule that pod. In order to fix this add the following to your PV:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- <your-region>
- key: topology.kubernetes.io/zone
operator: In
values:
- <your-region><your-pv-zone-id>Important
Using the proper topology constraints is very important but more so with Karpenter. Our CSI driver should do this automatically for new volumes, but older ones might not have it.
Karpenter Controller Logs
Karpenter itself can be debugged by looking at the logs of the containers. This can either be done in CloudWatch logs (in the /aws/eks/<cluster_name>/pods group) or through the CLI:
kubectl -n kube-system logs -f deployment/karpenterWhen nodes act up, node debugging can be done with SSM. Nodes can be removed by running kubectl delete node ip-x-x-x-x.eu-west-1.compute.internal.
More in depth documentation on how to troubleshoot Karpenter can be found here.
Useful One-Liners
# List all nodepools and their current node counts
kubectl get nodepools
# Check disruption budget status
kubectl get nodepools -o json | jq '.items[] | {name: .metadata.name, budgets: .spec.disruption.budgets}'
# Find nodes that are expired or drifted
kubectl get nodes -l karpenter.sh/initialized=true -o json | \
jq '.items[] | select(.metadata.annotations["karpenter.sh/nodepool"] != null) | \
{name: .metadata.name, disruption: .metadata.annotations}'
# Watch Karpenter decisions in real time
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter | grep -E '(launch|terminate|consolidat|drift|expire)'