🎯

karpenter

🎯Skill

from cosmix/claude-loom

What it does

Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.

📦

Part of

cosmix/claude-loom(12 items)

karpenter

Installation

Install ScriptRun install script

curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bash

git cloneClone repository

git clone https://github.com/cosmix/loom.git

Install ScriptRun install script

bash install.sh

📖 Extracted from docs: cosmix/claude-loom

Need more details? View full documentation on GitHub →

1Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Kubernetes node autoscaling and cost optimization with Karpenter. Use when implementing node provisioning, spot instance management, cluster right-sizing, node consolidation, or reducing compute costs. Covers NodePool configuration, EC2NodeClass setup, disruption budgets, spot/on-demand mix strategies, multi-architecture support, and capacity-type selection.

Overview

# Karpenter

Overview

Karpenter is a Kubernetes node autoscaler that provisions right-sized compute resources in response to changing application load. Unlike Cluster Autoscaler which scales predefined node groups, Karpenter provisions nodes based on aggregate pod resource requirements, enabling better bin-packing and cost optimization.

Key Differences from Cluster Autoscaler

Direct provisioning: Talks directly to cloud provider APIs (no node groups required)
Fast scaling: Provisions nodes in seconds vs minutes
Flexible instance selection: Chooses from all available instance types automatically
Consolidation: Actively replaces nodes with cheaper alternatives
Spot instance optimization: First-class support with automatic fallback

When to Use Karpenter

Running workloads with diverse resource requirements
Need for fast scaling (sub-minute response)
Cost optimization with spot instances and Graviton (ARM64)
Consolidation to reduce cluster waste and over-provisioning
Clusters with unpredictable or bursty workloads
Right-sizing infrastructure to actual usage patterns
Managing mixed capacity types (spot/on-demand) automatically

Instructions

1. Installation and Setup

Install Karpenter controller in cluster
Configure cloud provider credentials (IAM roles)
Set up instance profiles and security groups
Create NodePools for different workload types
Define EC2NodeClass (AWS) or equivalent for your provider

2. Design NodePool Strategy

Separate NodePools for different workload classes
Define instance type families and sizes
Configure spot/on-demand mix
Set resource limits per NodePool
Plan for multi-AZ distribution

3. Configure Disruption Management

Set disruption budgets to control churn
Configure consolidation policies
Define expiration windows for node lifecycle
Handle workload-specific disruption constraints
Test disruption scenarios

4. Optimize for Cost and Performance

Enable consolidation for cost savings
Use spot instances with fallback strategies
Set appropriate resource requests on pods (Karpenter depends on accurate requests)
Monitor node utilization and waste
Adjust instance type restrictions based on usage
Leverage Graviton (ARM64) instances for 20% cost reduction
Configure capacity-type weighting to prefer spot over on-demand

5. Cost Optimization Strategies

Spot instances: Configure 70-90% spot mix for fault-tolerant workloads
Graviton (ARM64): Use c7g, m7g, r7g families for lower costs
Consolidation: Enable WhenUnderutilized policy to replace expensive nodes
Instance diversity: Wide instance family selection improves spot availability
Right-sizing: Let Karpenter bin-pack efficiently instead of over-provisioning

6. Spot Instance Management

Use wide instance type selection (10+ families) for better spot availability
Configure automatic fallback to on-demand when spot unavailable
Implement Pod Disruption Budgets to control blast radius
Set graceful termination handlers in applications (preStop hooks)
Monitor spot interruption rates and adjust instance selection
Use diverse availability zones to reduce correlated failures

7. Node Consolidation

WhenUnderutilized: Replaces nodes with cheaper/smaller alternatives actively
WhenEmpty: Only consolidates completely empty nodes (conservative)
Configure consolidateAfter delay to prevent churn (30s-600s typical)
Use disruption budgets to limit consolidation rate (5-20% per window)
Respect Pod Disruption Budgets during consolidation
Set expiration windows to force periodic node refresh

Best Practices

Start Conservative: Begin with restrictive instance types, expand based on observation
Use Disruption Budgets: Prevent too many nodes from being disrupted simultaneously
Set Pod Resource Requests: Karpenter relies on accurate requests for scheduling
Enable Consolidation: Let Karpenter optimize node utilization automatically
Separate Workload Classes: Use multiple NodePools for different requirements
Monitor Provisioning: Track provisioning latency and failures
Test Spot Interruptions: Ensure graceful handling of spot instance terminations
Use Topology Spread: Combine with pod topology constraints for availability

Examples

Example 1: Basic NodePool with Multiple Instance Types

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

# Template for nodes created by this NodePool

template:

spec:

# Reference to EC2NodeClass (AWS-specific configuration)

nodeClassRef:

# Requirements that constrain instance selection

requirements:

# Use amd64 or arm64 architectures

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# Allow multiple instance families

- key: karpenter.k8s.aws/instance-family

operator: In

values:

["c6a", "c6i", "c7i", "m6a", "m6i", "m7i", "r6a", "r6i", "r7i"]

# Allow a range of instance sizes

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["large", "xlarge", "2xlarge", "4xlarge"]

# Use 80% spot, 20% on-demand

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

# Spread across availability zones

- key: topology.kubernetes.io/zone

operator: In

values: ["us-west-2a", "us-west-2b", "us-west-2c"]

# Kubelet configuration

kubelet:

# Set max pods based on instance size

maxPods: 110

# Memory reservation for system components

systemReserved:

cpu: 100m

memory: 100Mi

ephemeral-storage: 1Gi

# Eviction thresholds

evictionHard:

memory.available: 5%

nodefs.available: 10%

# Image garbage collection

imageGCHighThresholdPercent: 85

imageGCLowThresholdPercent: 80

# Taints and labels

taints:

- key: workload-type

value: general

effect: NoSchedule

# Metadata applied to nodes

metadata:

labels:

workload-type: general

managed-by: karpenter

# Limits for this NodePool

limits:

cpu: 1000

memory: 1000Gi

# Disruption controls

disruption:

# Consolidation policy

consolidationPolicy: WhenUnderutilized

# Time window for when disruptions are allowed

consolidateAfter: 30s

# Budgets control the rate of disruptions

budgets:

- nodes: 10%

duration: 5m

# Node weight for scheduling decisions (higher = preferred)

weight: 10

```

Example 2: EC2NodeClass for AWS-Specific Configuration

```yaml

apiVersion: karpenter.k8s.aws/v1beta1

kind: EC2NodeClass

metadata:

spec:

# AMI selection

amiFamily: AL2

# Alternative: Use specific AMI selector

# amiSelectorTerms:

# - id: ami-0123456789abcdef0

# - tags:

# karpenter.sh/discovery: my-cluster

# IAM role for nodes (instance profile)

role: KarpenterNodeRole-my-cluster

# Subnet selection - use tags to identify subnets

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

kubernetes.io/role/internal-elb: "1"

# Security group selection

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

- name: my-cluster-node-security-group

# User data for node initialization

userData: |

#!/bin/bash

echo "Custom node initialization"

# Configure container runtime

# Set up logging

# Install monitoring agents

# Block device mappings for EBS volumes

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 100Gi

volumeType: gp3

iops: 3000

throughput: 125

encrypted: true

deleteOnTermination: true

# Metadata options for IMDS

metadataOptions:

httpEndpoint: enabled

httpProtocolIPv6: disabled

httpPutResponseHopLimit: 2

httpTokens: required

# Detailed monitoring

detailedMonitoring: true

# Tags applied to EC2 instances

tags:

Name: karpenter-node

Environment: production

ManagedBy: karpenter

ClusterName: my-cluster

```

Example 3: Specialized NodePools for Different Workloads

```yaml

---

# GPU workload NodePool

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["g5", "g6", "p4", "p5"]

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # GPU instances typically on-demand

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ["0"]

taints:

- key: nvidia.com/gpu

value: "true"

effect: NoSchedule

metadata:

labels:

workload-type: gpu

nvidia.com/gpu: "true"

limits:

cpu: 500

memory: 2000Gi

nvidia.com/gpu: 16

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 300s

---

# Batch/Spot-heavy NodePool

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot"] # Only spot instances

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6a", "c6i", "c7i", "m6a", "m6i"] # Compute-optimized

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["2xlarge", "4xlarge", "8xlarge"]

taints:

- key: workload-type

value: batch

effect: NoSchedule

metadata:

labels:

workload-type: batch

spot-interruption-handler: enabled

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 60s

budgets:

- nodes: 20% # Allow more aggressive disruption for batch

---

# Stateful workload NodePool (on-demand only)

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # Only on-demand for stability

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["r6i", "r7i"] # Memory-optimized

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["xlarge", "2xlarge", "4xlarge"]

- key: topology.kubernetes.io/zone

operator: In

values: ["us-west-2a", "us-west-2b"]

kubelet:

maxPods: 50 # Lower density for stateful workloads

taints:

- key: workload-type

value: stateful

effect: NoSchedule

metadata:

labels:

workload-type: stateful

storage-optimized: "true"

limits:

cpu: 200

memory: 800Gi

disruption:

consolidationPolicy: WhenEmpty # Only consolidate when completely empty

consolidateAfter: 600s # Wait 10 minutes

budgets:

- nodes: 1 # Very conservative disruption

duration: 30m

```

Example 4: Disruption Budgets and Consolidation Policies

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6i", "m6i", "r6i"]

# Advanced disruption configuration

disruption:

# Consolidation policy options:

# - WhenUnderutilized: Replace nodes with cheaper/smaller nodes

# - WhenEmpty: Only replace completely empty nodes

consolidationPolicy: WhenUnderutilized

# How soon after a node becomes eligible for consolidation

consolidateAfter: 30s

# Expiration settings - force node replacement after time period

expireAfter: 720h # 30 days

# Multiple budget windows for different times/scenarios

budgets:

# During business hours: conservative disruption

- nodes: 5%

duration: 8h

schedule: "0 8 MON-FRI"

# During off-hours: more aggressive consolidation

- nodes: 20%

duration: 16h

schedule: "0 18 MON-FRI"

# Weekends: most aggressive

- nodes: 30%

duration: 48h

schedule: "0 0 SAT"

# Default budget (always active)

- nodes: 10%

```

Example 5: Pod Scheduling with Karpenter

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 5

selector:

matchLabels:

app: my-application

template:

metadata:

labels:

app: my-application

spec:

# Tolerations to allow scheduling on Karpenter nodes

tolerations:

- key: workload-type

operator: Equal

value: general

effect: NoSchedule

# Node selector to target specific NodePool

nodeSelector:

workload-type: general

karpenter.sh/capacity-type: spot # Prefer spot

# Affinity rules for better placement

affinity:

# Spread across zones for availability

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app: my-application

topologyKey: topology.kubernetes.io/zone

# Node affinity for instance type preferences

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

# Prefer ARM instances (cheaper)

- weight: 50

preference:

matchExpressions:

- key: kubernetes.io/arch

operator: In

values: ["arm64"]

# Prefer larger instances (better bin-packing)

- weight: 30

preference:

matchExpressions:

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["2xlarge", "4xlarge"]

# Topology spread constraints

topologySpreadConstraints:

# Spread across zones

- maxSkew: 1

topologyKey: topology.kubernetes.io/zone

whenUnsatisfiable: ScheduleAnyway

labelSelector:

matchLabels:

app: my-application

# Spread across nodes

- maxSkew: 1

topologyKey: kubernetes.io/hostname

whenUnsatisfiable: ScheduleAnyway

labelSelector:

matchLabels:

app: my-application

containers:

- name: app

image: my-app:latest

# CRITICAL: Accurate resource requests for Karpenter

resources:

requests:

cpu: 500m

memory: 1Gi

limits:

cpu: 1000m

memory: 2Gi

# Graceful shutdown for spot interruptions

lifecycle:

preStop:

exec:

command:

- /bin/sh

- -c

- sleep 15 # Allow time for deregistration

# Termination grace period for spot interruptions

terminationGracePeriodSeconds: 30

```

Example 6: Spot Instance Handling and Fallback

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

# Prioritize spot, but allow on-demand as fallback

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

# Wide instance type selection for better spot availability

- key: karpenter.k8s.aws/instance-family

operator: In

values:

- "c5a"

- "c6a"

- "c6i"

- "c7i"

- "m5a"

- "m6a"

- "m6i"

- "m7i"

- "r5a"

- "r6a"

- "r6i"

- "r7i"

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["large", "xlarge", "2xlarge", "4xlarge"]

# Support both architectures for more spot options

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# Metadata to track spot usage

metadata:

labels:

spot-enabled: "true"

annotations:

karpenter.sh/spot-to-spot-consolidation: "true"

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 30s

# More aggressive for spot since they can be interrupted anyway

budgets:

- nodes: 25%

# Weight influences Karpenter's NodePool selection

# Higher weight = more preferred

# Use lower weight so other NodePools are tried first

weight: 5

```

Example 7: Karpenter with Pod Disruption Budget

```yaml

# Application Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 6

selector:

matchLabels:

app: critical-service

template:

metadata:

labels:

app: critical-service

spec:

tolerations:

- key: workload-type

operator: Equal

value: general

effect: NoSchedule

containers:

- name: app

image: critical-service:latest

resources:

requests:

cpu: 1000m

memory: 2Gi

limits:

cpu: 2000m

memory: 4Gi

---

# Pod Disruption Budget to protect during consolidation

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

spec:

minAvailable: 4 # Always keep at least 4 replicas running

selector:

matchLabels:

app: critical-service

# Karpenter respects PDBs during consolidation

# It will not disrupt nodes if doing so would violate the PDB

```

Example 8: Multi-Architecture NodePool

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

# Support both AMD64 and ARM64

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# ARM instances (Graviton) - typically 20% cheaper

- key: karpenter.k8s.aws/instance-family

operator: In

values:

# ARM (Graviton2)

- "c6g"

- "m6g"

- "r6g"

# ARM (Graviton3)

- "c7g"

- "m7g"

- "r7g"

# AMD64 alternatives

- "c6i"

- "m6i"

- "r6i"

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

metadata:

labels:

multi-arch: "true"

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 60s

---

# EC2NodeClass with multi-architecture AMI support

apiVersion: karpenter.k8s.aws/v1beta1

kind: EC2NodeClass

metadata:

spec:

# AL2 automatically selects the right AMI for architecture

amiFamily: AL2

# Alternative: Explicit AMI selection by architecture

# amiSelectorTerms:

# - tags:

# karpenter.sh/discovery: my-cluster

# kubernetes.io/arch: amd64

# - tags:

# karpenter.sh/discovery: my-cluster

# kubernetes.io/arch: arm64

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

```

Monitoring and Troubleshooting

Key Metrics to Monitor

```text

# Provisioning metrics

karpenter_nodes_created_total

karpenter_nodes_terminated_total

karpenter_provisioner_scheduling_duration_seconds

# Disruption metrics

karpenter_disruption_replacement_node_initialized_seconds

karpenter_disruption_consolidation_actions_performed_total

karpenter_disruption_budgets_allowed_disruptions

# Cost metrics

karpenter_provisioner_instance_type_price_estimate

karpenter_cloudprovider_instance_type_offering_price_estimate

# Pod metrics

karpenter_pods_state (pending, running, etc.)

```

Common Issues and Solutions

#### Issue: Pods stuck in Pending

Check NodePool requirements match pod node selectors/tolerations
Verify cloud provider limits not exceeded
Check instance type availability in selected zones
Ensure subnet capacity available

#### Issue: Excessive node churn

Adjust consolidation delay (consolidateAfter)
Review disruption budgets
Check if pod resource requests are accurate
Consider using WhenEmpty instead of WhenUnderutilized

#### Issue: High costs despite using Karpenter

Enable consolidation if not already active
Verify spot instances are being used
Check if pods have unnecessarily large resource requests
Review instance type selection (allow more variety)

#### Issue: Spot interruptions causing service disruption

Implement Pod Disruption Budgets
Use diverse instance types for better spot availability
Configure appropriate replica counts
Implement graceful shutdown in applications

Integration with Terraform

```hcl

# Install Karpenter via Terraform

resource "helm_release" "karpenter" {

namespace = "karpenter"

create_namespace = true

name = "karpenter"

repository = "oci://public.ecr.aws/karpenter"

chart = "karpenter"

version = "v0.33.0"

values = [

<<-EOT

settings:

clusterName: ${var.cluster_name}

clusterEndpoint: ${var.cluster_endpoint}

interruptionQueue: ${var.interruption_queue_name}

serviceAccount:

annotations:

eks.amazonaws.com/role-arn: ${var.karpenter_irsa_arn}

controller:

resources:

requests:

cpu: 1

memory: 1Gi

limits:

cpu: 2

memory: 2Gi

EOT

]

depends_on = [

aws_iam_role_policy_attachment.karpenter_controller

]

}

# Deploy default NodePool

resource "kubectl_manifest" "karpenter_nodepool_default" {

yaml_body = <<-YAML

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

spec:

template:

spec:

nodeClassRef:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6i", "m6i", "r6i"]

limits:

cpu: 1000

memory: 1000Gi

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 30s

YAML

depends_on = [helm_release.karpenter]

}

```

Migration from Cluster Autoscaler

Plan the migration

- Identify current node groups and their characteristics

- Map workloads to new NodePool configurations

- Plan for coexistence period

Deploy Karpenter alongside Cluster Autoscaler

- Install Karpenter in the cluster

- Create NodePools with distinct labels

- Test with non-critical workloads first

Migrate workloads incrementally

- Update pod specs with Karpenter tolerations/node selectors

- Monitor provisioning and consolidation behavior

- Validate cost and performance metrics

Remove Cluster Autoscaler

- Once all workloads migrated, scale down CA node groups

- Remove Cluster Autoscaler deployment

- Clean up CA-specific resources

More from this repository10

🎯

refactoring🎯Skill

Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.

🎯

data-validation🎯Skill

Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.

🎯

event-driven🎯Skill

Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.

🎯

logging-observability🎯Skill

Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.

🎯

background-jobs🎯Skill

Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.

🎯

testing🎯Skill

Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.

🎯

auth🎯Skill

Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.

🎯

feature-flags🎯Skill

Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.

🎯

ci-cd🎯Skill

Designs and implements robust CI/CD pipelines with automated testing, security scanning, and deployment strategies across multiple platforms and tools.

🎯

grafana🎯Skill

Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.