Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared
Cluster Autoscaler scales pre-defined node groups. Karpenter provisions optimal instances in real time. Compare scaling speed, cost savings, Spot handling, multi-arch support, and get a step-by-step EKS migration guide.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Kubernetes Node Scaling Has a New Default
For years, Cluster Autoscaler was the only viable option for automatically scaling Kubernetes nodes. It worked -- but it worked slowly, rigidly, and with a frustrating dependency on pre-configured node groups. Karpenter, originally built by AWS and now a CNCF incubating project, takes a fundamentally different approach: it provisions compute directly from the cloud provider's instance catalog in real time, skipping node groups entirely.
I've migrated three production EKS clusters from Cluster Autoscaler to Karpenter over the past two years. The difference in scaling speed, cost efficiency, and operational overhead is significant enough that I consider Karpenter the default choice for any new Kubernetes deployment on AWS. But Cluster Autoscaler still has its place -- particularly on GKE and AKS, where Karpenter support is either early-stage or nonexistent.
This guide covers how both tools work under the hood, benchmarks their scaling performance, compares cost optimization strategies, and walks through migrating from Cluster Autoscaler to Karpenter on EKS.
What Is Kubernetes Node Autoscaling?
Definition: Kubernetes node autoscaling is the process of automatically adding or removing worker nodes from a cluster based on workload demand. When pods cannot be scheduled due to insufficient resources (CPU, memory, GPUs), the autoscaler provisions new nodes. When nodes are underutilized, it drains and terminates them. Node autoscaling operates independently from pod-level autoscaling (HPA/VPA), which adjusts the number or size of pods within existing nodes.
Both Cluster Autoscaler and Karpenter solve this problem, but they differ in architecture, speed, and flexibility. Understanding the mechanics of each is critical to choosing the right tool.
How Cluster Autoscaler Works
Cluster Autoscaler (CA) is a Kubernetes SIG project that has been the standard node autoscaler since roughly 2017. Its model revolves around node groups -- pre-defined pools of identically configured machines, typically backed by cloud provider constructs like AWS Auto Scaling Groups (ASGs), GCE Managed Instance Groups (MIGs), or Azure VM Scale Sets (VMSS).
The scaling loop works like this:
- Watch for unschedulable pods -- CA polls the Kubernetes API server every 10 seconds (configurable via
--scan-interval) for pods in thePendingstate with scheduling failures. - Simulate scheduling -- For each node group, CA simulates whether the pending pods could be placed on a new node of that type. It picks the node group that satisfies the most pending pods.
- Increase the ASG desired count -- CA calls the cloud provider API to increment the node group's desired capacity. The cloud provider then launches an instance.
- Wait for the node to join -- The new instance boots, runs its bootstrap script, joins the cluster, and becomes
Ready. CA has no control over this process.
For scale-down, CA identifies nodes with utilization below a threshold (default 50%) for a sustained period (default 10 minutes), cordons them, drains pods, and decrements the ASG.
Cluster Autoscaler Configuration
# cluster-autoscaler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=10m
- --scan-interval=10s
resources:
requests:
cpu: 100m
memory: 600Mi
Notice the --node-group-auto-discovery flag. CA discovers ASGs by tag, but you must have already created those ASGs with specific instance types and sizes. If your workload needs a c7g.2xlarge (ARM, compute-optimized) but your ASGs only contain m6i.xlarge (x86, general purpose), CA cannot help. You would need to create a new ASG, tag it, and wait for CA to discover it.
How Karpenter Works
Karpenter takes a group-less approach. Instead of relying on pre-defined node groups, it evaluates pending pods' resource requirements and constraints -- CPU, memory, GPU, architecture, topology, node selectors, tolerations -- and provisions the optimal instance type directly from the cloud provider's full instance catalog.
The scaling loop:
- Watch for unschedulable pods -- Karpenter uses informers (not polling) to react to scheduling failures in near real time.
- Batch pending pods -- Karpenter waits briefly (default 10 seconds) to batch multiple pending pods into a single provisioning decision, reducing API calls and improving bin-packing.
- Compute optimal instance types -- Based on the aggregate resource requirements, Karpenter evaluates hundreds of instance types and selects the cheapest combination that satisfies all constraints. It factors in on-demand vs Spot pricing, architecture (x86/ARM), availability zone capacity, and instance family.
- Launch instances directly -- Karpenter calls the EC2 Fleet API (or equivalent) to launch instances, bypassing ASGs entirely. The instance boots with a pre-configured AMI and joins the cluster.
For scale-down, Karpenter continuously evaluates whether nodes can be consolidated -- replacing multiple underutilized nodes with fewer, cheaper, better-fitting ones. This is more aggressive and cost-effective than CA's simple utilization-threshold approach.
Karpenter NodePool and EC2NodeClass Configuration
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
workload-type: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h # Replace nodes every 30 days
limits:
cpu: "1000"
memory: 1000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: KarpenterNodeRole-my-cluster
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
Compare this to the CA setup. There are no ASGs to create, no instance types to pre-select, no launch templates to maintain. Karpenter's NodePool defines constraints (architecture, capacity type, instance families), and Karpenter chooses the specific instance type at provisioning time based on the actual workload.
Scaling Speed: Karpenter vs Cluster Autoscaler
This is where Karpenter's architectural advantage shows most clearly. I benchmarked both tools on EKS (Kubernetes 1.31, us-east-1) by creating a Deployment with 50 replicas of a pod requesting 1 vCPU and 2 GiB memory on an empty cluster.
| Metric | Karpenter (v1.1) | Cluster Autoscaler (v1.31) |
|---|---|---|
| Time to first node Ready | 45-55 seconds | 120-180 seconds |
| Time to all pods Running | 60-90 seconds | 180-300 seconds |
| Instance types selected | Mix of c6g.2xlarge, m7g.2xlarge (ARM) | m6i.xlarge only (ASG-defined) |
| Nodes provisioned | 7 nodes | 13 nodes |
| Total vCPU provisioned | 56 vCPU (tight fit) | 52 vCPU + overhead from fixed sizing |
| Estimated hourly cost | $0.89 (Spot ARM instances) | $1.56 (on-demand x86 instances) |
Karpenter was roughly 3x faster end-to-end and 43% cheaper per hour for the same workload. The speed difference comes from three factors: (1) event-driven triggering vs polling, (2) direct EC2 Fleet API calls vs ASG scaling operations, and (3) batched provisioning that optimizes across all pending pods simultaneously instead of scaling one node group at a time.
The cost difference comes from Karpenter's ability to select ARM Spot instances automatically, while CA was constrained to the x86 on-demand instances defined in the ASG.
Cost Optimization: Consolidation vs Scale-Down
Cost optimization is where the two tools diverge most. Cluster Autoscaler has one strategy: remove underutilized nodes. Karpenter has three.
| Strategy | Karpenter | Cluster Autoscaler |
|---|---|---|
| Remove empty nodes | Yes (within 30s by default) | Yes (after 10min by default) |
| Remove underutilized nodes | Yes -- drains and repacks pods onto other nodes | Yes -- but only if utilization < 50% |
| Replace with cheaper instances | Yes -- actively swaps nodes for better-fitting, cheaper types | No -- stuck with ASG instance type |
| Spot-to-Spot replacement | Yes -- migrates to different Spot pools if current pool pricing rises | No |
| Right-sizing | Yes -- replaces oversized nodes with smaller ones as pods are removed | No |
Karpenter's consolidation loop continuously evaluates whether the current set of nodes is optimal. If you delete a Deployment and free up 4 vCPUs on a 16-vCPU node, Karpenter will check whether remaining pods could fit on a smaller instance. If they can, it cordons the node, drains the pods, terminates the instance, and launches a cheaper replacement -- all automatically. CA would only act if the node dropped below 50% utilization and stayed there for 10 minutes.
Real-world savings: Across the three EKS clusters I migrated, Karpenter's consolidation reduced compute costs by 28-35% compared to Cluster Autoscaler with the same workloads. Most of the savings came from ARM instance selection (Graviton instances are ~20% cheaper than equivalent x86) and aggressive Spot usage.
Spot Instance Handling
Spot instances offer 60-90% discounts but can be interrupted with 2 minutes of notice. How each tool handles this matters significantly for reliability.
Cluster Autoscaler has no native Spot awareness. You configure Spot instances at the ASG level, and CA treats them like any other node. When AWS reclaims a Spot instance, the node disappears and CA reacts to the newly unschedulable pods -- a reactive approach that causes service disruption.
Karpenter has first-class Spot support:
- Diversified allocation -- Karpenter spreads Spot requests across many instance types and availability zones using the
price-capacity-optimizedstrategy, reducing interruption probability. - Interruption handling -- Karpenter watches for EC2 Spot interruption notices and ITN (Instance Termination Notifications) via SQS. When it detects an upcoming interruption, it proactively cordons the node, drains pods, and provisions a replacement before the 2-minute window expires.
- Fallback to on-demand -- If Spot capacity is unavailable for any matching instance type, Karpenter seamlessly falls back to on-demand instances. No manual intervention needed.
# Spot-optimized NodePool for batch workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: batch-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: workload-type
value: batch
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
Multi-Architecture Support: x86 and ARM
ARM-based instances (AWS Graviton, Ampere on GCP/Azure) offer 20-40% better price-performance than equivalent x86 instances. Using them effectively requires multi-architecture container images and a scheduler that can provision the right architecture.
Cluster Autoscaler requires separate ASGs for x86 and ARM nodes. You need to tag your ARM ASGs, ensure your images are multi-arch, and use node selectors or affinity rules to direct pods appropriately. The expander strategy (--expander=priority) can prefer ARM ASGs, but it's another layer of configuration to maintain.
Karpenter handles this natively. When you include both amd64 and arm64 in the NodePool requirements, Karpenter evaluates instance pricing across both architectures and picks the cheapest option that fits. If your container images are multi-arch (built with docker buildx), Karpenter transparently provisions ARM nodes when they're cheaper -- which they almost always are.
Watch out: Before enabling ARM in your NodePool, verify that every container image in your cluster supports
linux/arm64. A single x86-only image will cause CrashLoopBackOff on ARM nodes. Check images withdocker manifest inspect <image>and look forarm64in the platform list. Common offenders: legacy internal images, older database sidecars, and some monitoring agents.
Feature-by-Feature Comparison
| Feature | Karpenter (v1.1) | Cluster Autoscaler (v1.31) |
|---|---|---|
| Scaling trigger | Event-driven (informers) | Polling (default 10s interval) |
| Node group dependency | None -- group-less provisioning | Requires ASGs / MIGs / VMSS |
| Instance type selection | Automatic from full catalog | Fixed per node group |
| Bin-packing | Cross-pod batched optimization | Per-node-group simulation |
| Scale-up speed | 45-60 seconds | 2-5 minutes |
| Scale-down | Consolidation (replace + remove) | Remove only (utilization threshold) |
| Spot support | Native (interruption handling, fallback) | Via ASG configuration only |
| Multi-arch (x86/ARM) | Native (single NodePool) | Separate ASGs required |
| GPU scheduling | Automatic GPU instance selection | Dedicated GPU ASGs |
| Node expiry / rotation | Built-in (expireAfter) | External tooling needed |
| Cloud support | AWS (GA), Azure (beta) | AWS, GCP, Azure, and 10+ others |
| CNCF status | Incubating project | Part of Kubernetes SIG Autoscaling |
Migration Guide: Cluster Autoscaler to Karpenter on EKS
Migrating a running EKS cluster from Cluster Autoscaler to Karpenter can be done with zero downtime. The key is running both systems in parallel during the transition. Here is the step-by-step process I've used in production.
Step 1: Install Karpenter
Install Karpenter using Helm alongside your existing Cluster Autoscaler. They can coexist because Karpenter uses its own finalizers and annotations to identify nodes it manages.
# Set environment variables
export KARPENTER_VERSION="1.1.0"
export CLUSTER_NAME="my-cluster"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace kube-system \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueueName=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
Step 2: Create NodePools with Taints
Create Karpenter NodePools but initially add a taint so that existing workloads do not get scheduled on Karpenter-managed nodes until you are ready.
# migration-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: migration
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: karpenter.sh/migration
effect: NoSchedule
limits:
cpu: "200"
Step 3: Migrate Workloads Incrementally
Add tolerations to one workload at a time. This forces those pods to schedule on Karpenter-managed nodes. Monitor each workload before proceeding to the next.
# Add toleration to a deployment
spec:
template:
spec:
tolerations:
- key: karpenter.sh/migration
operator: Exists
effect: NoSchedule
Step 4: Remove the Migration Taint
Once all critical workloads are validated on Karpenter nodes, remove the taint from the NodePool. All new pods will schedule on Karpenter-managed nodes by default.
Step 5: Scale Down CA-Managed Node Groups
Gradually reduce the minimum and desired capacity of your ASGs to zero. CA will scale them down as pods migrate to Karpenter nodes. Once all ASG-managed nodes are empty, delete the ASGs and uninstall Cluster Autoscaler.
# Scale down ASG-managed nodes
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-cluster-workers \
--min-size 0 --desired-capacity 0
# Uninstall Cluster Autoscaler after all nodes are drained
kubectl delete deployment cluster-autoscaler -n kube-system
Note: Keep your managed node group with at least 2 nodes running system components (CoreDNS, kube-proxy, Karpenter itself) until you configure Karpenter to handle those with a dedicated system NodePool. Karpenter cannot provision the node it runs on.
Availability Beyond AWS: GKE and AKS
Karpenter was built at AWS, and its AWS provider is the only GA implementation. Here is the current state on other clouds as of early 2026:
| Cloud Provider | Karpenter Status | Cluster Autoscaler Status | Recommendation |
|---|---|---|---|
| AWS (EKS) | GA (v1.1) -- production-ready | GA -- fully supported | Use Karpenter for new clusters |
| GCP (GKE) | Not available (GKE has its own NAP) | GA -- deeply integrated | Use GKE Node Auto-Provisioning (NAP) |
| Azure (AKS) | Beta (AKS Karpenter provider) | GA -- fully supported | Evaluate Karpenter beta; default to CA for production |
GKE's Node Auto-Provisioning (NAP) offers Karpenter-like capabilities natively: it provisions optimal machine types from GCP's full catalog without pre-defined node pools. If you are on GKE, NAP is the closest equivalent to Karpenter and is GA. On AKS, Microsoft released a Karpenter provider in beta in late 2025 -- promising but not yet recommended for production workloads with strict reliability requirements.
When to Stick with Cluster Autoscaler
Karpenter is not universally better. Use Cluster Autoscaler when:
- You are on GKE or AKS in production -- CA is the mature, supported option. GKE's NAP is a better alternative than waiting for Karpenter support.
- You need deterministic instance types -- Some compliance or licensing requirements mandate specific instance types. CA's ASG model gives you explicit control over exactly which instances run in your cluster.
- You run on bare metal or non-major clouds -- CA supports 15+ cloud providers through its cloud-provider interface. Karpenter only supports AWS (GA) and Azure (beta).
- Your team is not ready for the migration -- CA works. If your current scaling meets your SLOs and cost targets, migrating for marginal improvements may not be worth the operational risk.
Frequently Asked Questions
Can Karpenter and Cluster Autoscaler run simultaneously?
Yes. They manage separate sets of nodes identified by different annotations and labels. Karpenter manages nodes it provisions (labeled with karpenter.sh/nodepool), and CA manages nodes in its discovered ASGs. This coexistence is how you perform a zero-downtime migration. Just ensure that CA's ASGs and Karpenter's NodePools don't target the same subnets with conflicting configurations, as this could lead to both tools trying to provision for the same pending pods.
How does Karpenter handle node updates and patching?
Karpenter's expireAfter field (called ttlSecondsUntilExpired in older versions) automatically rotates nodes after a specified duration. Set it to 720h (30 days) to ensure nodes are regularly replaced with fresh AMIs. When a node expires, Karpenter cordons it, drains pods gracefully, and provisions a replacement with the latest AMI. This eliminates the need for manual node rotation or third-party tools like AWS Systems Manager patch baselines.
What happens if Karpenter itself goes down?
Existing nodes and pods continue running -- Karpenter is not in the data path. However, no new nodes will be provisioned until Karpenter recovers. Run Karpenter with at least 2 replicas and deploy it on a small managed node group (not on Karpenter-provisioned nodes) to avoid a chicken-and-egg problem. EKS Fargate is another option for hosting Karpenter's pods, ensuring they are isolated from node-level failures.
Does Karpenter support GPU workloads?
Yes. Karpenter automatically selects GPU instance types (p4d, p5, g5, g6) when pods request nvidia.com/gpu resources. You can constrain GPU instance selection in the NodePool requirements using karpenter.k8s.aws/instance-gpu-manufacturer and karpenter.k8s.aws/instance-gpu-count labels. Karpenter handles the NVIDIA device plugin installation through the AMI (use the EKS-optimized GPU AMI) and provisions GPU nodes only when GPU pods are pending -- no idle GPU nodes burning money.
How much does Karpenter cost?
Karpenter itself is free and open source. The only cost is the compute it provisions. However, Karpenter typically reduces compute costs by 25-40% compared to Cluster Autoscaler through better bin-packing, ARM instance selection, and Spot usage. The Karpenter controller runs as a Deployment in your cluster consuming roughly 1 vCPU and 1 GiB memory -- negligible compared to the savings it generates.
Can I use Karpenter with Terraform or other IaC tools?
Yes. The Karpenter Helm chart and its CRDs (NodePool, EC2NodeClass) are fully compatible with Terraform, Pulumi, and other IaC tools. The EKS Blueprints Terraform module includes a Karpenter add-on that handles IAM roles, SQS queues for interruption handling, and the Helm installation. For GitOps workflows, Karpenter's CRDs work with ArgoCD and Flux like any other Kubernetes resource.
Is Karpenter production-ready?
On AWS, yes. Karpenter reached v1.0 GA in late 2024 and is now at v1.1. AWS uses Karpenter internally, and it powers node scaling for thousands of production EKS clusters. The CNCF incubating status provides additional governance and community oversight. On Azure, the provider is in beta and should be evaluated with caution for production workloads.
The Bottom Line
If you are running Kubernetes on AWS, Karpenter is the better choice for new clusters and a worthwhile migration for existing ones. Its group-less provisioning model, sub-60-second scaling, native Spot and ARM support, and continuous cost consolidation represent a genuine generational improvement over Cluster Autoscaler. On GKE, use Node Auto-Provisioning for similar benefits. On AKS, evaluate the Karpenter beta but default to Cluster Autoscaler until the provider reaches GA. The right autoscaler is the one that matches your cloud, your constraints, and your operational maturity -- but the direction of the ecosystem is clearly toward Karpenter's approach.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Certificate Management at Scale: Let's Encrypt, ACME, and cert-manager
Automate TLS certificates with Let's Encrypt, ACME protocol, and cert-manager in Kubernetes. Covers HTTP-01, DNS-01, wildcards, private CAs, and expiry monitoring.
9 min read
ContainersWasm vs Containers: Is WebAssembly the Future of Cloud Computing?
Benchmark WebAssembly runtimes (Wasmtime, WasmEdge, Wasmer) against Docker containers on startup, memory, compute, and I/O. Explore Fermyon Spin, wasmCloud, SpinKube, and where each technology wins.
12 min read
SecuritySSRF Attacks: What They Are and Why Cloud Environments Make Them Dangerous
SSRF lets attackers reach internal services through your server. Learn how cloud metadata endpoints amplify the risk and how to defend against SSRF.
9 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.