karpenter-autoscaling
π―Skillfrom adaptationio/skrillz
Automates intelligent Kubernetes node autoscaling on EKS, optimizing instance selection and reducing costs by 20-70% through Spot integration and consolidation.
Part of
adaptationio/skrillz(191 items)
Installation
/plugin marketplace add adaptationio/Skrillz/plugin install skrillz@adaptationio-Skrillz/plugin enable skrillz@adaptationio-Skrillz/plugin marketplace add /path/to/skrillz/plugin install skrillz@local+ 4 more commands
Skill Details
Karpenter for intelligent Kubernetes node autoscaling on EKS. Use when configuring node provisioning, optimizing costs with Spot instances, replacing Cluster Autoscaler, implementing consolidation, or achieving 20-70% cost savings.
Overview
# Karpenter Autoscaling for Amazon EKS
Intelligent, high-performance node autoscaling for Amazon EKS that provisions nodes in seconds, automatically selects optimal instance types, and reduces costs by 20-70% through Spot integration and consolidation.
Overview
Karpenter is the recommended autoscaler for production EKS workloads (2025), replacing Cluster Autoscaler with:
- Speed: Provisions nodes in seconds (vs minutes with Cluster Autoscaler)
- Intelligence: Automatically selects optimal instance types based on pod requirements
- Flexibility: No need to configure node groups - direct EC2 instance provisioning
- Cost Optimization: 20-70% cost reduction through better bin-packing and Spot integration
- Consolidation: Automatic node consolidation when underutilized or empty
Real-World Results:
- 20% overall AWS bill reduction
- Up to 90% savings for CI/CD workloads
- 70% reduction in monthly compute costs
- 15-30% waste reduction with faster scale-up
When to Use
- Replacing Cluster Autoscaler with faster, smarter autoscaling
- Optimizing EKS cluster costs (target: 20%+ savings)
- Implementing Spot instance strategies (30-70% Spot mix)
- Need sub-minute node provisioning (seconds vs minutes)
- Workloads with variable resource requirements
- Multi-instance-type flexibility without node group management
- GPU or specialized instance provisioning
- Consolidating underutilized nodes automatically
Prerequisites
- EKS cluster running Kubernetes 1.23+
- Terraform or Helm for installation
- IRSA or EKS Pod Identity enabled
- Small node group for Karpenter controller (2-3 nodes)
- VPC subnets and security groups tagged for Karpenter discovery
---
Quick Start
1. Install Karpenter (Helm)
```bash
# Add Karpenter Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update
# Install Karpenter v1.0+
helm upgrade --install karpenter karpenter/karpenter \
--namespace kube-system \
--set settings.clusterName=my-cluster \
--set settings.interruptionQueue=my-cluster \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
```
See: [references/installation.md](references/installation.md) for complete setup including IRSA/Pod Identity
2. Create NodePool and EC2NodeClass
NodePool (defines scheduling requirements and limits):
```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # Compute, general, memory-optimized
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"] # Gen 5+
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
memory: "1000Gi"
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "10%"
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023 # Amazon Linux 2023
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
```
```bash
kubectl apply -f nodepool.yaml
```
See: [references/nodepools.md](references/nodepools.md) for advanced NodePool patterns
3. Deploy Workload and Watch Autoscaling
```bash
# Deploy test workload
kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 \
--replicas=0
# Scale up to trigger node provisioning
kubectl scale deployment inflate --replicas=10
# Watch Karpenter provision nodes (seconds!)
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
# Verify nodes
kubectl get nodes -l karpenter.sh/nodepool=default
# Scale down to trigger consolidation
kubectl scale deployment inflate --replicas=0
# Watch Karpenter consolidate (30s after scale-down)
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
```
4. Monitor and Optimize
```bash
# Check NodePool status
kubectl get nodepools
# View disruption metrics
kubectl describe nodepool default
# Monitor provisioning decisions
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i "launched\|terminated"
# Cost optimization metrics
kubectl top nodes
```
See: [references/optimization.md](references/optimization.md) for cost optimization strategies
---
Core Concepts
Karpenter v1.0 Architecture
Key Resources (v1.0+):
- NodePool: Defines node scheduling requirements, limits, and disruption policies
- EC2NodeClass: AWS-specific configuration (AMIs, instance types, subnets, security groups)
- NodeClaim: Karpenter's representation of a node request (auto-created)
How It Works:
- Pod becomes unschedulable
- Karpenter evaluates pod requirements (CPU, memory, affinity, taints/tolerations)
- Karpenter selects optimal instance type from 600+ options
- Karpenter provisions EC2 instance directly (no node groups)
- Node joins cluster in 30-60 seconds
- Pod scheduled to new node
Consolidation:
- Continuously monitors node utilization
- Consolidates underutilized nodes (bin-packing)
- Drains and deletes empty nodes
- Replaces nodes with cheaper alternatives
- Respects Pod Disruption Budgets
NodePool vs Cluster Autoscaler Node Groups
| Feature | Karpenter NodePool | Cluster Autoscaler |
|---------|-------------------|-------------------|
| Provisioning Speed | 30-60 seconds | 2-5 minutes |
| Instance Selection | Automatic (600+ types) | Manual (pre-defined) |
| Bin-Packing | Intelligent | Limited |
| Spot Integration | Built-in, intelligent | Requires node groups |
| Consolidation | Automatic | Manual |
| Configuration | Single NodePool | Multiple node groups |
| Cost Savings | 20-70% | 10-20% |
---
Common Workflows
Workflow 1: Install Karpenter with Terraform
Use case: Production-grade installation with infrastructure as code
```hcl
# Karpenter module
module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
version = "~> 20.0"
cluster_name = module.eks.cluster_name
irsa_oidc_provider_arn = module.eks.oidc_provider_arn
# Enable Pod Identity (2025 recommended)
enable_pod_identity = true
# Additional IAM policies
node_iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
tags = {
Environment = "production"
}
}
# Helm release
resource "helm_release" "karpenter" {
namespace = "kube-system"
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "1.0.0"
set {
name = "settings.clusterName"
value = module.eks.cluster_name
}
set {
name = "settings.interruptionQueue"
value = module.karpenter.queue_name
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = module.karpenter.iam_role_arn
}
}
```
Steps:
- Review [
references/installation.md](references/installation.md) - Configure Terraform module with cluster details
- Apply infrastructure:
terraform apply - Verify installation:
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter - Tag subnets and security groups for discovery
- Deploy NodePool and EC2NodeClass
See: [references/installation.md](references/installation.md) for complete Terraform setup
---
Workflow 2: Configure Spot/On-Demand Mix (30/70)
Use case: Optimize costs while maintaining availability (recommended: 30% On-Demand, 70% Spot)
Critical NodePool (On-Demand only):
```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: critical
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
taints:
- key: "critical"
value: "true"
effect: "NoSchedule"
limits:
cpu: "200"
weight: 100 # Higher priority
```
Flexible NodePool (Spot preferred):
```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: flexible
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "800"
disruption:
consolidationPolicy: WhenUnderutilized
budgets:
- nodes: "20%"
weight: 10 # Lower priority (use after critical)
```
Pod tolerations for critical workloads:
```yaml
spec:
tolerations:
- key: "critical"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
karpenter.sh/capacity-type: on-demand
```
Steps:
- Create critical NodePool for databases, stateful apps (On-Demand)
- Create flexible NodePool for stateless apps (Spot preferred)
- Use taints/tolerations to separate critical workloads
- Monitor Spot interruptions:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i interrupt
See: [references/nodepools.md](references/nodepools.md) for Spot strategies
---
Workflow 3: Enable Consolidation for Cost Savings
Use case: Reduce costs by automatically consolidating underutilized nodes
Aggressive consolidation (development/staging):
```yaml
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s # Consolidate quickly
budgets:
- nodes: "50%" # Allow disrupting 50% of nodes
```
Conservative consolidation (production):
```yaml
spec:
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 5m # Wait 5 minutes before consolidating
budgets:
- nodes: "10%" # Limit disruption to 10% of nodes at a time
- schedule: "0 9-17 MON-FRI" # Only during business hours
nodes: "20%"
- schedule: "0 0-8,18-23 *" # Off-hours
nodes: "5%"
```
Pod Disruption Budget (protect critical pods):
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-app
```
Steps:
- Review [
references/optimization.md](references/optimization.md) - Set consolidation policy (WhenEmpty, WhenUnderutilized, WhenEmptyOrUnderutilized)
- Configure consolidateAfter delay (30s-5m)
- Set disruption budgets (% of nodes)
- Create PodDisruptionBudgets for critical apps
- Monitor consolidation:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep consolidat
Expected savings: 15-30% additional reduction beyond Spot savings
See: [references/optimization.md](references/optimization.md) for consolidation best practices
---
Workflow 4: Migrate from Cluster Autoscaler
Use case: Upgrade from Cluster Autoscaler to Karpenter for better performance and cost savings
Migration strategy (zero-downtime):
- Install Karpenter (runs alongside Cluster Autoscaler)
```bash
helm install karpenter karpenter/karpenter --namespace kube-system
```
- Create NodePool with distinct labels
```yaml
spec:
template:
metadata:
labels:
provisioner: karpenter
```
- Migrate workloads gradually
```yaml
# Add node selector to new deployments
spec:
nodeSelector:
provisioner: karpenter
```
- Monitor both autoscalers
```bash
# Watch Karpenter
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter
# Watch Cluster Autoscaler
kubectl logs -f -n kube-system -l app=cluster-autoscaler
```
- Gradually scale down CA node groups
```bash
# Reduce desired size of CA node groups
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name ca-nodes \
--scaling-config desiredSize=1,minSize=0,maxSize=3
```
- Remove Cluster Autoscaler tags
```bash
# Remove tags from node groups
# k8s.io/cluster-autoscaler/enabled
# k8s.io/cluster-autoscaler/
```
- Uninstall Cluster Autoscaler
```bash
helm uninstall cluster-autoscaler -n kube-system
```
Testing checklist:
- [ ] Karpenter provisions nodes successfully
- [ ] Pods schedule on Karpenter nodes
- [ ] Consolidation works as expected
- [ ] Spot interruptions handled gracefully
- [ ] No unschedulable pods
- [ ] Cost metrics show improvement
Rollback plan: Keep CA node groups at min size until confident in Karpenter
---
Workflow 5: GPU Node Provisioning
Use case: Automatically provision GPU instances for ML workloads
GPU NodePool:
```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # GPU typically on-demand
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g4dn", "g5", "p3", "p4d"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu
taints:
- key: "nvidia.com/gpu"
value: "true"
effect: "NoSchedule"
limits:
cpu: "1000"
nvidia.com/gpu: "8"
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu
spec:
amiFamily: AL2 # AL2 with GPU drivers
amiSelectorTerms:
- alias: al2@latest # Latest GPU-enabled AMI
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
userData: |
#!/bin/bash
# Install NVIDIA device plugin
/etc/eks/bootstrap.sh my-cluster
```
GPU workload:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: cuda-container
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
```
See: [references/nodepools.md](references/nodepools.md) for GPU configuration details
---
Key Configuration
NodePool Resource Limits
Prevent runaway scaling:
```yaml
spec:
limits:
cpu: "1000" # Max 1000 CPUs across all nodes in pool
memory: "1000Gi" # Max 1000Gi memory
nvidia.com/gpu: "8" # Max 8 GPUs
```
Disruption Controls
Balance cost savings with stability:
```yaml
spec:
disruption:
# When to consolidate
consolidationPolicy: WhenUnderutilized | WhenEmpty | WhenEmptyOrUnderutilized
# Delay before consolidating (prevent flapping)
consolidateAfter: 30s # Default: 30s
# Node expiration (security patching)
expireAfter: 720h # 30 days
# Disruption budgets (rate limiting)
budgets:
- nodes: "10%" # Max 10% of nodes disrupted at once
reasons:
- Underutilized
- Empty
- schedule: "0 0-8 *" # Off-hours: more aggressive
nodes: "50%"
```
Instance Type Flexibility
Maximize Spot availability and cost savings:
```yaml
spec:
template:
spec:
requirements:
# Architecture
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # Include ARM for savings
# Instance categories (c=compute, m=general, r=memory)
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
# Instance generation (5+ for best performance/cost)
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["4"]
# Instance size (exclude large sizes if not needed)
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: ["metal", "32xlarge", "24xlarge"]
# Capacity type
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
```
Result: Karpenter selects from 600+ instance types, maximizing Spot availability
---
Monitoring and Troubleshooting
Key Metrics
```bash
# NodePool status
kubectl get nodepools
# NodeClaim status (pending provisions)
kubectl get nodeclaims
# Node events
kubectl get events --field-selector involvedObject.kind=Node
# Karpenter controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -c controller --tail=100
# Filter for provisioning decisions
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "launched instance"
# Filter for consolidation events
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "consolidating"
# Spot interruption warnings
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "interrupt"
```
Common Issues
1. Nodes not provisioning:
```bash
# Check NodePool status
kubectl describe nodepool default
# Check for unschedulable pods
kubectl get pods -A --field-selector=status.phase=Pending
# Review Karpenter logs for errors
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i error
```
Common causes:
- Insufficient IAM permissions
- Subnet/security group tags missing
- Resource limits exceeded
- No instance types match requirements
2. Excessive consolidation (pod restarts):
```yaml
# Increase consolidateAfter delay
spec:
disruption:
consolidateAfter: 5m # Increase from 30s
```
3. Spot interruptions causing issues:
```yaml
# Reduce Spot ratio
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # Use more on-demand
```
---
Best Practices
Cost Optimization
- β Use 30% On-Demand, 70% Spot for optimal cost/stability balance
- β Enable consolidation (WhenUnderutilized)
- β Include ARM instances (Graviton) for 20% additional savings
- β Set instance generation > 4 for best price/performance
- β Use multiple instance families (c, m, r) for Spot diversity
Reliability
- β Set Pod Disruption Budgets for critical applications
- β Use multiple availability zones
- β Configure disruption budgets (10-20% for production)
- β Test Spot interruption handling
- β Use On-Demand for stateful workloads (databases)
Security
- β Use IRSA or Pod Identity (not node IAM roles)
- β Enable EBS encryption in EC2NodeClass
- β Set expireAfter for regular node rotation (720h/30 days)
- β Use Amazon Linux 2023 (AL2023) AMIs
- β Tag resources for cost allocation
Performance
- β Use dedicated NodePool for Karpenter controller (On-Demand, no consolidation)
- β Set appropriate resource limits to prevent runaway scaling
- β Monitor provisioning latency (should be <60s)
- β Use topology spread constraints for pod distribution
---
Reference Documentation
Detailed Guides (load on-demand):
- [
references/installation.md](references/installation.md) - Complete installation with Helm, Terraform, IRSA, Pod Identity - [
references/nodepools.md](references/nodepools.md) - NodePool and EC2NodeClass configuration patterns - [
references/optimization.md](references/optimization.md) - Cost optimization, consolidation, disruption budgets
Official Resources:
- [Karpenter Documentation](https://karpenter.sh/)
- [AWS Karpenter Best Practices](https://aws.github.io/aws-eks-best-practices/karpenter/)
- [Karpenter GitHub](https://github.com/aws/karpenter)
Community Examples:
- [Terraform EKS Karpenter Module](https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/karpenter)
- [Karpenter Blueprints](https://github.com/aws-samples/karpenter-blueprints)
---
Quick Reference
Installation
```bash
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version 1.0.0 \
--namespace kube-system \
--set settings.clusterName=my-cluster
```
Basic NodePool
```yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
kind: EC2NodeClass
name: default
limits:
cpu: "1000"
disruption:
consolidationPolicy: WhenUnderutilized
```
Monitor
```bash
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter
```
Cost Savings Formula
- Spot instances: 70-80% savings vs On-Demand
- Consolidation: Additional 15-30% reduction
- Better bin-packing: 10-20% waste reduction
- Total: 20-70% overall cost reduction
---
Next Steps: Install Karpenter using [references/installation.md](references/installation.md), then configure NodePools with [references/nodepools.md](references/nodepools.md)
More from this repository10
xai-stock-sentiment skill from adaptationio/skrillz
Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.
Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.
auto-claude-troubleshooting skill from adaptationio/skrillz
auto-claude-setup skill from adaptationio/skrillz
Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.
xai-financial-integration skill from adaptationio/skrillz
xai-agent-tools skill from adaptationio/skrillz
xai-crypto-sentiment skill from adaptationio/skrillz
auto-claude-memory skill from adaptationio/skrillz