🎯

karpenter

🎯Skill

from cosmix/claude-loom

VibeIndex|
What it does

Dynamically provisions and optimizes Kubernetes nodes using intelligent instance selection, spot management, and cost-efficient scaling strategies.

πŸ“¦

Part of

cosmix/claude-loom(12 items)

karpenter

Installation

Install ScriptRun install script
curl -fsSL https://raw.githubusercontent.com/cosmix/loom/main/install.sh | bash
git cloneClone repository
git clone https://github.com/cosmix/loom.git
Install ScriptRun install script
bash install.sh
πŸ“– Extracted from docs: cosmix/claude-loom
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Kubernetes node autoscaling and cost optimization with Karpenter. Use when implementing node provisioning, spot instance management, cluster right-sizing, node consolidation, or reducing compute costs. Covers NodePool configuration, EC2NodeClass setup, disruption budgets, spot/on-demand mix strategies, multi-architecture support, and capacity-type selection.

Overview

# Karpenter

Overview

Karpenter is a Kubernetes node autoscaler that provisions right-sized compute resources in response to changing application load. Unlike Cluster Autoscaler which scales predefined node groups, Karpenter provisions nodes based on aggregate pod resource requirements, enabling better bin-packing and cost optimization.

Key Differences from Cluster Autoscaler

  • Direct provisioning: Talks directly to cloud provider APIs (no node groups required)
  • Fast scaling: Provisions nodes in seconds vs minutes
  • Flexible instance selection: Chooses from all available instance types automatically
  • Consolidation: Actively replaces nodes with cheaper alternatives
  • Spot instance optimization: First-class support with automatic fallback

When to Use Karpenter

  • Running workloads with diverse resource requirements
  • Need for fast scaling (sub-minute response)
  • Cost optimization with spot instances and Graviton (ARM64)
  • Consolidation to reduce cluster waste and over-provisioning
  • Clusters with unpredictable or bursty workloads
  • Right-sizing infrastructure to actual usage patterns
  • Managing mixed capacity types (spot/on-demand) automatically

Instructions

1. Installation and Setup

  • Install Karpenter controller in cluster
  • Configure cloud provider credentials (IAM roles)
  • Set up instance profiles and security groups
  • Create NodePools for different workload types
  • Define EC2NodeClass (AWS) or equivalent for your provider

2. Design NodePool Strategy

  • Separate NodePools for different workload classes
  • Define instance type families and sizes
  • Configure spot/on-demand mix
  • Set resource limits per NodePool
  • Plan for multi-AZ distribution

3. Configure Disruption Management

  • Set disruption budgets to control churn
  • Configure consolidation policies
  • Define expiration windows for node lifecycle
  • Handle workload-specific disruption constraints
  • Test disruption scenarios

4. Optimize for Cost and Performance

  • Enable consolidation for cost savings
  • Use spot instances with fallback strategies
  • Set appropriate resource requests on pods (Karpenter depends on accurate requests)
  • Monitor node utilization and waste
  • Adjust instance type restrictions based on usage
  • Leverage Graviton (ARM64) instances for 20% cost reduction
  • Configure capacity-type weighting to prefer spot over on-demand

5. Cost Optimization Strategies

  • Spot instances: Configure 70-90% spot mix for fault-tolerant workloads
  • Graviton (ARM64): Use c7g, m7g, r7g families for lower costs
  • Consolidation: Enable WhenUnderutilized policy to replace expensive nodes
  • Instance diversity: Wide instance family selection improves spot availability
  • Right-sizing: Let Karpenter bin-pack efficiently instead of over-provisioning

6. Spot Instance Management

  • Use wide instance type selection (10+ families) for better spot availability
  • Configure automatic fallback to on-demand when spot unavailable
  • Implement Pod Disruption Budgets to control blast radius
  • Set graceful termination handlers in applications (preStop hooks)
  • Monitor spot interruption rates and adjust instance selection
  • Use diverse availability zones to reduce correlated failures

7. Node Consolidation

  • WhenUnderutilized: Replaces nodes with cheaper/smaller alternatives actively
  • WhenEmpty: Only consolidates completely empty nodes (conservative)
  • Configure consolidateAfter delay to prevent churn (30s-600s typical)
  • Use disruption budgets to limit consolidation rate (5-20% per window)
  • Respect Pod Disruption Budgets during consolidation
  • Set expiration windows to force periodic node refresh

Best Practices

  1. Start Conservative: Begin with restrictive instance types, expand based on observation
  2. Use Disruption Budgets: Prevent too many nodes from being disrupted simultaneously
  3. Set Pod Resource Requests: Karpenter relies on accurate requests for scheduling
  4. Enable Consolidation: Let Karpenter optimize node utilization automatically
  5. Separate Workload Classes: Use multiple NodePools for different requirements
  6. Monitor Provisioning: Track provisioning latency and failures
  7. Test Spot Interruptions: Ensure graceful handling of spot instance terminations
  8. Use Topology Spread: Combine with pod topology constraints for availability

Examples

Example 1: Basic NodePool with Multiple Instance Types

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: default

spec:

# Template for nodes created by this NodePool

template:

spec:

# Reference to EC2NodeClass (AWS-specific configuration)

nodeClassRef:

name: default

# Requirements that constrain instance selection

requirements:

# Use amd64 or arm64 architectures

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# Allow multiple instance families

- key: karpenter.k8s.aws/instance-family

operator: In

values:

["c6a", "c6i", "c7i", "m6a", "m6i", "m7i", "r6a", "r6i", "r7i"]

# Allow a range of instance sizes

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["large", "xlarge", "2xlarge", "4xlarge"]

# Use 80% spot, 20% on-demand

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

# Spread across availability zones

- key: topology.kubernetes.io/zone

operator: In

values: ["us-west-2a", "us-west-2b", "us-west-2c"]

# Kubelet configuration

kubelet:

# Set max pods based on instance size

maxPods: 110

# Memory reservation for system components

systemReserved:

cpu: 100m

memory: 100Mi

ephemeral-storage: 1Gi

# Eviction thresholds

evictionHard:

memory.available: 5%

nodefs.available: 10%

# Image garbage collection

imageGCHighThresholdPercent: 85

imageGCLowThresholdPercent: 80

# Taints and labels

taints:

- key: workload-type

value: general

effect: NoSchedule

# Metadata applied to nodes

metadata:

labels:

workload-type: general

managed-by: karpenter

# Limits for this NodePool

limits:

cpu: 1000

memory: 1000Gi

# Disruption controls

disruption:

# Consolidation policy

consolidationPolicy: WhenUnderutilized

# Time window for when disruptions are allowed

consolidateAfter: 30s

# Budgets control the rate of disruptions

budgets:

- nodes: 10%

duration: 5m

# Node weight for scheduling decisions (higher = preferred)

weight: 10

```

Example 2: EC2NodeClass for AWS-Specific Configuration

```yaml

apiVersion: karpenter.k8s.aws/v1beta1

kind: EC2NodeClass

metadata:

name: default

spec:

# AMI selection

amiFamily: AL2

# Alternative: Use specific AMI selector

# amiSelectorTerms:

# - id: ami-0123456789abcdef0

# - tags:

# karpenter.sh/discovery: my-cluster

# IAM role for nodes (instance profile)

role: KarpenterNodeRole-my-cluster

# Subnet selection - use tags to identify subnets

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

kubernetes.io/role/internal-elb: "1"

# Security group selection

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

- name: my-cluster-node-security-group

# User data for node initialization

userData: |

#!/bin/bash

echo "Custom node initialization"

# Configure container runtime

# Set up logging

# Install monitoring agents

# Block device mappings for EBS volumes

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 100Gi

volumeType: gp3

iops: 3000

throughput: 125

encrypted: true

deleteOnTermination: true

# Metadata options for IMDS

metadataOptions:

httpEndpoint: enabled

httpProtocolIPv6: disabled

httpPutResponseHopLimit: 2

httpTokens: required

# Detailed monitoring

detailedMonitoring: true

# Tags applied to EC2 instances

tags:

Name: karpenter-node

Environment: production

ManagedBy: karpenter

ClusterName: my-cluster

```

Example 3: Specialized NodePools for Different Workloads

```yaml

---

# GPU workload NodePool

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: gpu-workloads

spec:

template:

spec:

nodeClassRef:

name: gpu-nodes

requirements:

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["g5", "g6", "p4", "p5"]

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # GPU instances typically on-demand

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ["0"]

taints:

- key: nvidia.com/gpu

value: "true"

effect: NoSchedule

metadata:

labels:

workload-type: gpu

nvidia.com/gpu: "true"

limits:

cpu: 500

memory: 2000Gi

nvidia.com/gpu: 16

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 300s

---

# Batch/Spot-heavy NodePool

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: batch-workloads

spec:

template:

spec:

nodeClassRef:

name: default

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot"] # Only spot instances

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6a", "c6i", "c7i", "m6a", "m6i"] # Compute-optimized

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["2xlarge", "4xlarge", "8xlarge"]

taints:

- key: workload-type

value: batch

effect: NoSchedule

metadata:

labels:

workload-type: batch

spot-interruption-handler: enabled

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 60s

budgets:

- nodes: 20% # Allow more aggressive disruption for batch

---

# Stateful workload NodePool (on-demand only)

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: stateful-workloads

spec:

template:

spec:

nodeClassRef:

name: stateful-nodes

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["on-demand"] # Only on-demand for stability

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["r6i", "r7i"] # Memory-optimized

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["xlarge", "2xlarge", "4xlarge"]

- key: topology.kubernetes.io/zone

operator: In

values: ["us-west-2a", "us-west-2b"]

kubelet:

maxPods: 50 # Lower density for stateful workloads

taints:

- key: workload-type

value: stateful

effect: NoSchedule

metadata:

labels:

workload-type: stateful

storage-optimized: "true"

limits:

cpu: 200

memory: 800Gi

disruption:

consolidationPolicy: WhenEmpty # Only consolidate when completely empty

consolidateAfter: 600s # Wait 10 minutes

budgets:

- nodes: 1 # Very conservative disruption

duration: 30m

```

Example 4: Disruption Budgets and Consolidation Policies

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: production-apps

spec:

template:

spec:

nodeClassRef:

name: default

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6i", "m6i", "r6i"]

# Advanced disruption configuration

disruption:

# Consolidation policy options:

# - WhenUnderutilized: Replace nodes with cheaper/smaller nodes

# - WhenEmpty: Only replace completely empty nodes

consolidationPolicy: WhenUnderutilized

# How soon after a node becomes eligible for consolidation

consolidateAfter: 30s

# Expiration settings - force node replacement after time period

expireAfter: 720h # 30 days

# Multiple budget windows for different times/scenarios

budgets:

# During business hours: conservative disruption

- nodes: 5%

duration: 8h

schedule: "0 8 MON-FRI"

# During off-hours: more aggressive consolidation

- nodes: 20%

duration: 16h

schedule: "0 18 MON-FRI"

# Weekends: most aggressive

- nodes: 30%

duration: 48h

schedule: "0 0 SAT"

# Default budget (always active)

- nodes: 10%

```

Example 5: Pod Scheduling with Karpenter

```yaml

apiVersion: apps/v1

kind: Deployment

metadata:

name: my-application

spec:

replicas: 5

selector:

matchLabels:

app: my-application

template:

metadata:

labels:

app: my-application

spec:

# Tolerations to allow scheduling on Karpenter nodes

tolerations:

- key: workload-type

operator: Equal

value: general

effect: NoSchedule

# Node selector to target specific NodePool

nodeSelector:

workload-type: general

karpenter.sh/capacity-type: spot # Prefer spot

# Affinity rules for better placement

affinity:

# Spread across zones for availability

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app: my-application

topologyKey: topology.kubernetes.io/zone

# Node affinity for instance type preferences

nodeAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

# Prefer ARM instances (cheaper)

- weight: 50

preference:

matchExpressions:

- key: kubernetes.io/arch

operator: In

values: ["arm64"]

# Prefer larger instances (better bin-packing)

- weight: 30

preference:

matchExpressions:

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["2xlarge", "4xlarge"]

# Topology spread constraints

topologySpreadConstraints:

# Spread across zones

- maxSkew: 1

topologyKey: topology.kubernetes.io/zone

whenUnsatisfiable: ScheduleAnyway

labelSelector:

matchLabels:

app: my-application

# Spread across nodes

- maxSkew: 1

topologyKey: kubernetes.io/hostname

whenUnsatisfiable: ScheduleAnyway

labelSelector:

matchLabels:

app: my-application

containers:

- name: app

image: my-app:latest

# CRITICAL: Accurate resource requests for Karpenter

resources:

requests:

cpu: 500m

memory: 1Gi

limits:

cpu: 1000m

memory: 2Gi

# Graceful shutdown for spot interruptions

lifecycle:

preStop:

exec:

command:

- /bin/sh

- -c

- sleep 15 # Allow time for deregistration

# Termination grace period for spot interruptions

terminationGracePeriodSeconds: 30

```

Example 6: Spot Instance Handling and Fallback

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: spot-with-fallback

spec:

template:

spec:

nodeClassRef:

name: default

requirements:

# Prioritize spot, but allow on-demand as fallback

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

# Wide instance type selection for better spot availability

- key: karpenter.k8s.aws/instance-family

operator: In

values:

- "c5a"

- "c6a"

- "c6i"

- "c7i"

- "m5a"

- "m6a"

- "m6i"

- "m7i"

- "r5a"

- "r6a"

- "r6i"

- "r7i"

- key: karpenter.k8s.aws/instance-size

operator: In

values: ["large", "xlarge", "2xlarge", "4xlarge"]

# Support both architectures for more spot options

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# Metadata to track spot usage

metadata:

labels:

spot-enabled: "true"

annotations:

karpenter.sh/spot-to-spot-consolidation: "true"

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 30s

# More aggressive for spot since they can be interrupted anyway

budgets:

- nodes: 25%

# Weight influences Karpenter's NodePool selection

# Higher weight = more preferred

# Use lower weight so other NodePools are tried first

weight: 5

```

Example 7: Karpenter with Pod Disruption Budget

```yaml

# Application Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: critical-service

spec:

replicas: 6

selector:

matchLabels:

app: critical-service

template:

metadata:

labels:

app: critical-service

spec:

tolerations:

- key: workload-type

operator: Equal

value: general

effect: NoSchedule

containers:

- name: app

image: critical-service:latest

resources:

requests:

cpu: 1000m

memory: 2Gi

limits:

cpu: 2000m

memory: 4Gi

---

# Pod Disruption Budget to protect during consolidation

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: critical-service-pdb

spec:

minAvailable: 4 # Always keep at least 4 replicas running

selector:

matchLabels:

app: critical-service

# Karpenter respects PDBs during consolidation

# It will not disrupt nodes if doing so would violate the PDB

```

Example 8: Multi-Architecture NodePool

```yaml

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: multi-arch

spec:

template:

spec:

nodeClassRef:

name: default

requirements:

# Support both AMD64 and ARM64

- key: kubernetes.io/arch

operator: In

values: ["amd64", "arm64"]

# ARM instances (Graviton) - typically 20% cheaper

- key: karpenter.k8s.aws/instance-family

operator: In

values:

# ARM (Graviton2)

- "c6g"

- "m6g"

- "r6g"

# ARM (Graviton3)

- "c7g"

- "m7g"

- "r7g"

# AMD64 alternatives

- "c6i"

- "m6i"

- "r6i"

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

metadata:

labels:

multi-arch: "true"

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 60s

---

# EC2NodeClass with multi-architecture AMI support

apiVersion: karpenter.k8s.aws/v1beta1

kind: EC2NodeClass

metadata:

name: default

spec:

# AL2 automatically selects the right AMI for architecture

amiFamily: AL2

# Alternative: Explicit AMI selection by architecture

# amiSelectorTerms:

# - tags:

# karpenter.sh/discovery: my-cluster

# kubernetes.io/arch: amd64

# - tags:

# karpenter.sh/discovery: my-cluster

# kubernetes.io/arch: arm64

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

```

Monitoring and Troubleshooting

Key Metrics to Monitor

```text

# Provisioning metrics

karpenter_nodes_created_total

karpenter_nodes_terminated_total

karpenter_provisioner_scheduling_duration_seconds

# Disruption metrics

karpenter_disruption_replacement_node_initialized_seconds

karpenter_disruption_consolidation_actions_performed_total

karpenter_disruption_budgets_allowed_disruptions

# Cost metrics

karpenter_provisioner_instance_type_price_estimate

karpenter_cloudprovider_instance_type_offering_price_estimate

# Pod metrics

karpenter_pods_state (pending, running, etc.)

```

Common Issues and Solutions

#### Issue: Pods stuck in Pending

  • Check NodePool requirements match pod node selectors/tolerations
  • Verify cloud provider limits not exceeded
  • Check instance type availability in selected zones
  • Ensure subnet capacity available

#### Issue: Excessive node churn

  • Adjust consolidation delay (consolidateAfter)
  • Review disruption budgets
  • Check if pod resource requests are accurate
  • Consider using WhenEmpty instead of WhenUnderutilized

#### Issue: High costs despite using Karpenter

  • Enable consolidation if not already active
  • Verify spot instances are being used
  • Check if pods have unnecessarily large resource requests
  • Review instance type selection (allow more variety)

#### Issue: Spot interruptions causing service disruption

  • Implement Pod Disruption Budgets
  • Use diverse instance types for better spot availability
  • Configure appropriate replica counts
  • Implement graceful shutdown in applications

Integration with Terraform

```hcl

# Install Karpenter via Terraform

resource "helm_release" "karpenter" {

namespace = "karpenter"

create_namespace = true

name = "karpenter"

repository = "oci://public.ecr.aws/karpenter"

chart = "karpenter"

version = "v0.33.0"

values = [

<<-EOT

settings:

clusterName: ${var.cluster_name}

clusterEndpoint: ${var.cluster_endpoint}

interruptionQueue: ${var.interruption_queue_name}

serviceAccount:

annotations:

eks.amazonaws.com/role-arn: ${var.karpenter_irsa_arn}

controller:

resources:

requests:

cpu: 1

memory: 1Gi

limits:

cpu: 2

memory: 2Gi

EOT

]

depends_on = [

aws_iam_role_policy_attachment.karpenter_controller

]

}

# Deploy default NodePool

resource "kubectl_manifest" "karpenter_nodepool_default" {

yaml_body = <<-YAML

apiVersion: karpenter.sh/v1beta1

kind: NodePool

metadata:

name: default

spec:

template:

spec:

nodeClassRef:

name: default

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ["spot", "on-demand"]

- key: karpenter.k8s.aws/instance-family

operator: In

values: ["c6i", "m6i", "r6i"]

limits:

cpu: 1000

memory: 1000Gi

disruption:

consolidationPolicy: WhenUnderutilized

consolidateAfter: 30s

YAML

depends_on = [helm_release.karpenter]

}

```

Migration from Cluster Autoscaler

  1. Plan the migration

- Identify current node groups and their characteristics

- Map workloads to new NodePool configurations

- Plan for coexistence period

  1. Deploy Karpenter alongside Cluster Autoscaler

- Install Karpenter in the cluster

- Create NodePools with distinct labels

- Test with non-critical workloads first

  1. Migrate workloads incrementally

- Update pod specs with Karpenter tolerations/node selectors

- Monitor provisioning and consolidation behavior

- Validate cost and performance metrics

  1. Remove Cluster Autoscaler

- Once all workloads migrated, scale down CA node groups

- Remove Cluster Autoscaler deployment

- Clean up CA-specific resources

More from this repository10

🎯
refactoring🎯Skill

Systematically restructures code to enhance readability, maintainability, and performance while preserving its original behavior.

🎯
data-validation🎯Skill

Validates and sanitizes data across various formats and use cases, ensuring data integrity and security.

🎯
event-driven🎯Skill

Enables scalable, loosely-coupled systems by implementing event-driven architectures with message queues, pub/sub patterns, and distributed transaction management across various messaging platforms.

🎯
logging-observability🎯Skill

Enables comprehensive logging and observability by providing structured logging, distributed tracing, metrics collection, and centralized log management patterns.

🎯
background-jobs🎯Skill

Manages asynchronous task processing with robust job queues, scheduling, worker pools, and advanced retry strategies across various frameworks.

🎯
testing🎯Skill

Comprehensively tests software across domains, implementing unit, integration, and end-to-end tests with TDD/BDD workflows and robust test architecture.

🎯
auth🎯Skill

Implements robust authentication and authorization patterns including OAuth2, JWT, MFA, access control, and identity management.

🎯
feature-flags🎯Skill

Enables runtime feature control through configurable flags for gradual rollouts, A/B testing, user targeting, and dynamic system configuration.

🎯
ci-cd🎯Skill

Designs and implements robust CI/CD pipelines with automated testing, security scanning, and deployment strategies across multiple platforms and tools.

🎯
grafana🎯Skill

Designs and configures Grafana dashboards, panels, and visualizations for observability using LGTM stack technologies like Loki, Tempo, and Mimir.