🎯

karpenter-autoscaling

🎯Skill

from adaptationio/skrillz

VibeIndex|
What it does

Automates intelligent Kubernetes node autoscaling on EKS, optimizing instance selection and reducing costs by 20-70% through Spot integration and consolidation.

πŸ“¦

Part of

adaptationio/skrillz(191 items)

karpenter-autoscaling

Installation

Add MarketplaceAdd marketplace to Claude Code
/plugin marketplace add adaptationio/Skrillz
Install PluginInstall plugin from marketplace
/plugin install skrillz@adaptationio-Skrillz
Claude CodeAdd plugin in Claude Code
/plugin enable skrillz@adaptationio-Skrillz
Add MarketplaceAdd marketplace to Claude Code
/plugin marketplace add /path/to/skrillz
Install PluginInstall plugin from marketplace
/plugin install skrillz@local

+ 4 more commands

πŸ“– Extracted from docs: adaptationio/skrillz
1Installs
3
-
Last UpdatedJan 16, 2026

Skill Details

SKILL.md

Karpenter for intelligent Kubernetes node autoscaling on EKS. Use when configuring node provisioning, optimizing costs with Spot instances, replacing Cluster Autoscaler, implementing consolidation, or achieving 20-70% cost savings.

Overview

# Karpenter Autoscaling for Amazon EKS

Intelligent, high-performance node autoscaling for Amazon EKS that provisions nodes in seconds, automatically selects optimal instance types, and reduces costs by 20-70% through Spot integration and consolidation.

Overview

Karpenter is the recommended autoscaler for production EKS workloads (2025), replacing Cluster Autoscaler with:

  • Speed: Provisions nodes in seconds (vs minutes with Cluster Autoscaler)
  • Intelligence: Automatically selects optimal instance types based on pod requirements
  • Flexibility: No need to configure node groups - direct EC2 instance provisioning
  • Cost Optimization: 20-70% cost reduction through better bin-packing and Spot integration
  • Consolidation: Automatic node consolidation when underutilized or empty

Real-World Results:

  • 20% overall AWS bill reduction
  • Up to 90% savings for CI/CD workloads
  • 70% reduction in monthly compute costs
  • 15-30% waste reduction with faster scale-up

When to Use

  • Replacing Cluster Autoscaler with faster, smarter autoscaling
  • Optimizing EKS cluster costs (target: 20%+ savings)
  • Implementing Spot instance strategies (30-70% Spot mix)
  • Need sub-minute node provisioning (seconds vs minutes)
  • Workloads with variable resource requirements
  • Multi-instance-type flexibility without node group management
  • GPU or specialized instance provisioning
  • Consolidating underutilized nodes automatically

Prerequisites

  • EKS cluster running Kubernetes 1.23+
  • Terraform or Helm for installation
  • IRSA or EKS Pod Identity enabled
  • Small node group for Karpenter controller (2-3 nodes)
  • VPC subnets and security groups tagged for Karpenter discovery

---

Quick Start

1. Install Karpenter (Helm)

```bash

# Add Karpenter Helm repo

helm repo add karpenter https://charts.karpenter.sh

helm repo update

# Install Karpenter v1.0+

helm upgrade --install karpenter karpenter/karpenter \

--namespace kube-system \

--set settings.clusterName=my-cluster \

--set settings.interruptionQueue=my-cluster \

--set controller.resources.requests.cpu=1 \

--set controller.resources.requests.memory=1Gi \

--set controller.resources.limits.cpu=1 \

--set controller.resources.limits.memory=1Gi \

--wait

```

See: [references/installation.md](references/installation.md) for complete setup including IRSA/Pod Identity

2. Create NodePool and EC2NodeClass

NodePool (defines scheduling requirements and limits):

```yaml

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: default

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: kubernetes.io/arch

operator: In

values: ["amd64"]

- key: karpenter.k8s.aws/instance-category

operator: In

values: ["c", "m", "r"] # Compute, general, memory-optimized

- key: karpenter.k8s.aws/instance-generation

operator: Gt

values: ["4"] # Gen 5+

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: default

limits:

cpu: "1000"

memory: "1000Gi"

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 30s

budgets:

- nodes: "10%"

---

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

name: default

spec:

amiFamily: AL2023 # Amazon Linux 2023

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 100Gi

volumeType: gp3

encrypted: true

deleteOnTermination: true

```

```bash

kubectl apply -f nodepool.yaml

```

See: [references/nodepools.md](references/nodepools.md) for advanced NodePool patterns

3. Deploy Workload and Watch Autoscaling

```bash

# Deploy test workload

kubectl create deployment inflate --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 \

--replicas=0

# Scale up to trigger node provisioning

kubectl scale deployment inflate --replicas=10

# Watch Karpenter provision nodes (seconds!)

kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller

# Verify nodes

kubectl get nodes -l karpenter.sh/nodepool=default

# Scale down to trigger consolidation

kubectl scale deployment inflate --replicas=0

# Watch Karpenter consolidate (30s after scale-down)

kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller

```

4. Monitor and Optimize

```bash

# Check NodePool status

kubectl get nodepools

# View disruption metrics

kubectl describe nodepool default

# Monitor provisioning decisions

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i "launched\|terminated"

# Cost optimization metrics

kubectl top nodes

```

See: [references/optimization.md](references/optimization.md) for cost optimization strategies

---

Core Concepts

Karpenter v1.0 Architecture

Key Resources (v1.0+):

  1. NodePool: Defines node scheduling requirements, limits, and disruption policies
  2. EC2NodeClass: AWS-specific configuration (AMIs, instance types, subnets, security groups)
  3. NodeClaim: Karpenter's representation of a node request (auto-created)

How It Works:

  1. Pod becomes unschedulable
  2. Karpenter evaluates pod requirements (CPU, memory, affinity, taints/tolerations)
  3. Karpenter selects optimal instance type from 600+ options
  4. Karpenter provisions EC2 instance directly (no node groups)
  5. Node joins cluster in 30-60 seconds
  6. Pod scheduled to new node

Consolidation:

  • Continuously monitors node utilization
  • Consolidates underutilized nodes (bin-packing)
  • Drains and deletes empty nodes
  • Replaces nodes with cheaper alternatives
  • Respects Pod Disruption Budgets

NodePool vs Cluster Autoscaler Node Groups

| Feature | Karpenter NodePool | Cluster Autoscaler |

|---------|-------------------|-------------------|

| Provisioning Speed | 30-60 seconds | 2-5 minutes |

| Instance Selection | Automatic (600+ types) | Manual (pre-defined) |

| Bin-Packing | Intelligent | Limited |

| Spot Integration | Built-in, intelligent | Requires node groups |

| Consolidation | Automatic | Manual |

| Configuration | Single NodePool | Multiple node groups |

| Cost Savings | 20-70% | 10-20% |

---

Common Workflows

Workflow 1: Install Karpenter with Terraform

Use case: Production-grade installation with infrastructure as code

```hcl

# Karpenter module

module "karpenter" {

source = "terraform-aws-modules/eks/aws//modules/karpenter"

version = "~> 20.0"

cluster_name = module.eks.cluster_name

irsa_oidc_provider_arn = module.eks.oidc_provider_arn

# Enable Pod Identity (2025 recommended)

enable_pod_identity = true

# Additional IAM policies

node_iam_role_additional_policies = {

AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"

}

tags = {

Environment = "production"

}

}

# Helm release

resource "helm_release" "karpenter" {

namespace = "kube-system"

name = "karpenter"

repository = "oci://public.ecr.aws/karpenter"

chart = "karpenter"

version = "1.0.0"

set {

name = "settings.clusterName"

value = module.eks.cluster_name

}

set {

name = "settings.interruptionQueue"

value = module.karpenter.queue_name

}

set {

name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"

value = module.karpenter.iam_role_arn

}

}

```

Steps:

  1. Review [references/installation.md](references/installation.md)
  2. Configure Terraform module with cluster details
  3. Apply infrastructure: terraform apply
  4. Verify installation: kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
  5. Tag subnets and security groups for discovery
  6. Deploy NodePool and EC2NodeClass

See: [references/installation.md](references/installation.md) for complete Terraform setup

---

Workflow 2: Configure Spot/On-Demand Mix (30/70)

Use case: Optimize costs while maintaining availability (recommended: 30% On-Demand, 70% Spot)

Critical NodePool (On-Demand only):

```yaml

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: critical

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"]

- key: karpenter.k8s.aws/instance-category

operator: In

values: ["m", "c"]

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: default

taints:

- key: "critical"

value: "true"

effect: "NoSchedule"

limits:

cpu: "200"

weight: 100 # Higher priority

```

Flexible NodePool (Spot preferred):

```yaml

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: flexible

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: karpenter.k8s.aws/instance-category

operator: In

values: ["c", "m", "r"]

- key: karpenter.k8s.aws/instance-generation

operator: Gt

values: ["4"]

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: default

limits:

cpu: "800"

disruption:

consolidationPolicy: WhenUnderutilized

budgets:

- nodes: "20%"

weight: 10 # Lower priority (use after critical)

```

Pod tolerations for critical workloads:

```yaml

spec:

tolerations:

- key: "critical"

operator: "Equal"

value: "true"

effect: "NoSchedule"

nodeSelector:

karpenter.sh/capacity-type: on-demand

```

Steps:

  1. Create critical NodePool for databases, stateful apps (On-Demand)
  2. Create flexible NodePool for stateless apps (Spot preferred)
  3. Use taints/tolerations to separate critical workloads
  4. Monitor Spot interruptions: kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i interrupt

See: [references/nodepools.md](references/nodepools.md) for Spot strategies

---

Workflow 3: Enable Consolidation for Cost Savings

Use case: Reduce costs by automatically consolidating underutilized nodes

Aggressive consolidation (development/staging):

```yaml

spec:

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 30s # Consolidate quickly

budgets:

- nodes: "50%" # Allow disrupting 50% of nodes

```

Conservative consolidation (production):

```yaml

spec:

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 5m # Wait 5 minutes before consolidating

budgets:

- nodes: "10%" # Limit disruption to 10% of nodes at a time

- schedule: "0 9-17 MON-FRI" # Only during business hours

nodes: "20%"

- schedule: "0 0-8,18-23 *" # Off-hours

nodes: "5%"

```

Pod Disruption Budget (protect critical pods):

```yaml

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: critical-app-pdb

spec:

minAvailable: 2

selector:

matchLabels:

app: critical-app

```

Steps:

  1. Review [references/optimization.md](references/optimization.md)
  2. Set consolidation policy (WhenEmpty, WhenUnderutilized, WhenEmptyOrUnderutilized)
  3. Configure consolidateAfter delay (30s-5m)
  4. Set disruption budgets (% of nodes)
  5. Create PodDisruptionBudgets for critical apps
  6. Monitor consolidation: kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep consolidat

Expected savings: 15-30% additional reduction beyond Spot savings

See: [references/optimization.md](references/optimization.md) for consolidation best practices

---

Workflow 4: Migrate from Cluster Autoscaler

Use case: Upgrade from Cluster Autoscaler to Karpenter for better performance and cost savings

Migration strategy (zero-downtime):

  1. Install Karpenter (runs alongside Cluster Autoscaler)

```bash

helm install karpenter karpenter/karpenter --namespace kube-system

```

  1. Create NodePool with distinct labels

```yaml

spec:

template:

metadata:

labels:

provisioner: karpenter

```

  1. Migrate workloads gradually

```yaml

# Add node selector to new deployments

spec:

nodeSelector:

provisioner: karpenter

```

  1. Monitor both autoscalers

```bash

# Watch Karpenter

kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter

# Watch Cluster Autoscaler

kubectl logs -f -n kube-system -l app=cluster-autoscaler

```

  1. Gradually scale down CA node groups

```bash

# Reduce desired size of CA node groups

aws eks update-nodegroup-config \

--cluster-name my-cluster \

--nodegroup-name ca-nodes \

--scaling-config desiredSize=1,minSize=0,maxSize=3

```

  1. Remove Cluster Autoscaler tags

```bash

# Remove tags from node groups

# k8s.io/cluster-autoscaler/enabled

# k8s.io/cluster-autoscaler/

```

  1. Uninstall Cluster Autoscaler

```bash

helm uninstall cluster-autoscaler -n kube-system

```

Testing checklist:

  • [ ] Karpenter provisions nodes successfully
  • [ ] Pods schedule on Karpenter nodes
  • [ ] Consolidation works as expected
  • [ ] Spot interruptions handled gracefully
  • [ ] No unschedulable pods
  • [ ] Cost metrics show improvement

Rollback plan: Keep CA node groups at min size until confident in Karpenter

---

Workflow 5: GPU Node Provisioning

Use case: Automatically provision GPU instances for ML workloads

GPU NodePool:

```yaml

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # GPU typically on-demand

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["g4dn", "g5", "p3", "p4d"]

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ["0"]

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu

taints:

- key: "nvidia.com/gpu"

value: "true"

effect: "NoSchedule"

limits:

cpu: "1000"

nvidia.com/gpu: "8"

---

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

name: gpu

spec:

amiFamily: AL2 # AL2 with GPU drivers

amiSelectorTerms:

- alias: al2@latest # Latest GPU-enabled AMI

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

userData: |

#!/bin/bash

# Install NVIDIA device plugin

/etc/eks/bootstrap.sh my-cluster

```

GPU workload:

```yaml

apiVersion: v1

kind: Pod

metadata:

name: gpu-pod

spec:

tolerations:

- key: "nvidia.com/gpu"

operator: "Exists"

effect: "NoSchedule"

containers:

- name: cuda-container

image: nvidia/cuda:11.8.0-base-ubuntu22.04

command: ["nvidia-smi"]

resources:

limits:

nvidia.com/gpu: 1

```

See: [references/nodepools.md](references/nodepools.md) for GPU configuration details

---

Key Configuration

NodePool Resource Limits

Prevent runaway scaling:

```yaml

spec:

limits:

cpu: "1000" # Max 1000 CPUs across all nodes in pool

memory: "1000Gi" # Max 1000Gi memory

nvidia.com/gpu: "8" # Max 8 GPUs

```

Disruption Controls

Balance cost savings with stability:

```yaml

spec:

disruption:

# When to consolidate

consolidationPolicy: WhenUnderutilized | WhenEmpty | WhenEmptyOrUnderutilized

# Delay before consolidating (prevent flapping)

consolidateAfter: 30s # Default: 30s

# Node expiration (security patching)

expireAfter: 720h # 30 days

# Disruption budgets (rate limiting)

budgets:

- nodes: "10%" # Max 10% of nodes disrupted at once

reasons:

- Underutilized

- Empty

- schedule: "0 0-8 *" # Off-hours: more aggressive

nodes: "50%"

```

Instance Type Flexibility

Maximize Spot availability and cost savings:

```yaml

spec:

template:

spec:

requirements:

# Architecture

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"] # Include ARM for savings

# Instance categories (c=compute, m=general, r=memory)

- key: karpenter.k8s.aws/instance-category

operator: In

values: ["c", "m", "r"]

# Instance generation (5+ for best performance/cost)

- key: karpenter.k8s.aws/instance-generation

operator: Gt

values: ["4"]

# Instance size (exclude large sizes if not needed)

- key: karpenter.k8s.aws/instance-size

operator: NotIn

values: ["metal", "32xlarge", "24xlarge"]

# Capacity type

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

```

Result: Karpenter selects from 600+ instance types, maximizing Spot availability

---

Monitoring and Troubleshooting

Key Metrics

```bash

# NodePool status

kubectl get nodepools

# NodeClaim status (pending provisions)

kubectl get nodeclaims

# Node events

kubectl get events --field-selector involvedObject.kind=Node

# Karpenter controller logs

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -c controller --tail=100

# Filter for provisioning decisions

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "launched instance"

# Filter for consolidation events

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "consolidating"

# Spot interruption warnings

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep "interrupt"

```

Common Issues

1. Nodes not provisioning:

```bash

# Check NodePool status

kubectl describe nodepool default

# Check for unschedulable pods

kubectl get pods -A --field-selector=status.phase=Pending

# Review Karpenter logs for errors

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter | grep -i error

```

Common causes:

  • Insufficient IAM permissions
  • Subnet/security group tags missing
  • Resource limits exceeded
  • No instance types match requirements

2. Excessive consolidation (pod restarts):

```yaml

# Increase consolidateAfter delay

spec:

disruption:

consolidateAfter: 5m # Increase from 30s

```

3. Spot interruptions causing issues:

```yaml

# Reduce Spot ratio

  • key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # Use more on-demand

```

---

Best Practices

Cost Optimization

  • βœ… Use 30% On-Demand, 70% Spot for optimal cost/stability balance
  • βœ… Enable consolidation (WhenUnderutilized)
  • βœ… Include ARM instances (Graviton) for 20% additional savings
  • βœ… Set instance generation > 4 for best price/performance
  • βœ… Use multiple instance families (c, m, r) for Spot diversity

Reliability

  • βœ… Set Pod Disruption Budgets for critical applications
  • βœ… Use multiple availability zones
  • βœ… Configure disruption budgets (10-20% for production)
  • βœ… Test Spot interruption handling
  • βœ… Use On-Demand for stateful workloads (databases)

Security

  • βœ… Use IRSA or Pod Identity (not node IAM roles)
  • βœ… Enable EBS encryption in EC2NodeClass
  • βœ… Set expireAfter for regular node rotation (720h/30 days)
  • βœ… Use Amazon Linux 2023 (AL2023) AMIs
  • βœ… Tag resources for cost allocation

Performance

  • βœ… Use dedicated NodePool for Karpenter controller (On-Demand, no consolidation)
  • βœ… Set appropriate resource limits to prevent runaway scaling
  • βœ… Monitor provisioning latency (should be <60s)
  • βœ… Use topology spread constraints for pod distribution

---

Reference Documentation

Detailed Guides (load on-demand):

  • [references/installation.md](references/installation.md) - Complete installation with Helm, Terraform, IRSA, Pod Identity
  • [references/nodepools.md](references/nodepools.md) - NodePool and EC2NodeClass configuration patterns
  • [references/optimization.md](references/optimization.md) - Cost optimization, consolidation, disruption budgets

Official Resources:

  • [Karpenter Documentation](https://karpenter.sh/)
  • [AWS Karpenter Best Practices](https://aws.github.io/aws-eks-best-practices/karpenter/)
  • [Karpenter GitHub](https://github.com/aws/karpenter)

Community Examples:

  • [Terraform EKS Karpenter Module](https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/karpenter)
  • [Karpenter Blueprints](https://github.com/aws-samples/karpenter-blueprints)

---

Quick Reference

Installation

```bash

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \

--version 1.0.0 \

--namespace kube-system \

--set settings.clusterName=my-cluster

```

Basic NodePool

```yaml

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: default

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

nodeClassRef:

kind: EC2NodeClass

name: default

limits:

cpu: "1000"

disruption:

consolidationPolicy: WhenUnderutilized

```

Monitor

```bash

kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter

```

Cost Savings Formula

  • Spot instances: 70-80% savings vs On-Demand
  • Consolidation: Additional 15-30% reduction
  • Better bin-packing: 10-20% waste reduction
  • Total: 20-70% overall cost reduction

---

Next Steps: Install Karpenter using [references/installation.md](references/installation.md), then configure NodePools with [references/nodepools.md](references/nodepools.md)