🎯

eks-troubleshooting

🎯Skill

from adaptationio/skrillz

VibeIndex|
What it does

eks-troubleshooting skill from adaptationio/skrillz

πŸ“¦

Part of

adaptationio/skrillz(191 items)

eks-troubleshooting

Installation

Add MarketplaceAdd marketplace to Claude Code
/plugin marketplace add adaptationio/Skrillz
Install PluginInstall plugin from marketplace
/plugin install skrillz@adaptationio-Skrillz
Claude CodeAdd plugin in Claude Code
/plugin enable skrillz@adaptationio-Skrillz
Add MarketplaceAdd marketplace to Claude Code
/plugin marketplace add /path/to/skrillz
Install PluginInstall plugin from marketplace
/plugin install skrillz@local

+ 4 more commands

πŸ“– Extracted from docs: adaptationio/skrillz
1Installs
3
-
Last UpdatedJan 16, 2026

Skill Details

SKILL.md

EKS troubleshooting and debugging guide covering pod failures, cluster issues, networking problems, and performance diagnostics. Use when diagnosing cluster issues, debugging pod failures (CrashLoopBackOff, Pending, OOMKilled), resolving networking problems, investigating performance issues, troubleshooting IAM/IRSA permissions, fixing image pull errors, or analyzing EKS cluster health.

Overview

# EKS Troubleshooting Guide

Overview

Comprehensive troubleshooting guide for Amazon EKS clusters covering control plane issues, node problems, pod failures, networking, storage, security, and performance diagnostics. Based on 2025 AWS best practices.

Keywords: EKS, Kubernetes, kubectl, debugging, troubleshooting, pod failure, node issues, networking, DNS, AWS, diagnostics

When to Use This Skill

Pod Issues:

  • Pods stuck in Pending, CrashLoopBackOff, ImagePullBackOff
  • OOMKilled containers or CPU throttling
  • Pod evictions or unexpected terminations
  • Application errors in container logs

Cluster Issues:

  • Nodes showing NotReady status
  • Control plane API server timeouts
  • Cluster autoscaling failures
  • etcd performance problems

Networking Problems:

  • DNS resolution failures
  • Service connectivity issues
  • Load balancer provisioning errors
  • Cross-AZ traffic problems
  • VPC CNI IP exhaustion

Security & Permissions:

  • IAM/IRSA permission denied errors
  • RBAC access issues
  • Image pull authentication failures
  • Service account problems

Performance Issues:

  • Slow pod startup times
  • High memory/CPU usage
  • Resource contention
  • Network latency

Quick Diagnostic Workflow

1. Initial Triage (First 60 Seconds)

```bash

# Check cluster accessibility

kubectl cluster-info

# Get overall cluster status

kubectl get nodes

kubectl get pods --all-namespaces | grep -v Running

# Check recent events

kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Check control plane components

kubectl get --raw /healthz

kubectl get componentstatuses # Deprecated in 1.19+ but still useful

```

2. Identify Problem Area

Pod Issues:

```bash

# Get pod status

kubectl get pods -n

kubectl describe pod -n

kubectl logs -n --previous # Previous container logs

```

Node Issues:

```bash

# Check node status

kubectl get nodes -o wide

kubectl describe node

kubectl top nodes # Requires metrics-server

```

Cluster-Wide Issues:

```bash

# Check all resources

kubectl get all --all-namespaces

kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check EKS cluster health (AWS CLI)

aws eks describe-cluster --name --query 'cluster.health'

```

3. Deep Dive (See Reference Guides)

  • Pod/Workload Issues β†’ references/workload-issues.md
  • Node/Cluster Issues β†’ references/cluster-issues.md
  • Networking Issues β†’ references/networking-issues.md

Common Pod Failure Patterns

Pending Pods

Symptoms:

  • Pod stuck in Pending state indefinitely
  • No containers running

Quick Check:

```bash

kubectl describe pod -n | grep -A 10 Events

```

Common Causes:

  1. Insufficient Resources: No nodes with available CPU/memory
  2. Node Selector Mismatch: Pod requires node labels that don't exist
  3. PVC Not Available: Persistent volume claim not bound
  4. Taints/Tolerations: Pod can't tolerate node taints

Quick Fixes:

```bash

# Check resource requests

kubectl get pod -n -o yaml | grep -A 5 resources

# Check node capacity

kubectl describe nodes | grep -A 5 "Allocated resources"

# For Karpenter clusters - check provisioner logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter

```

CrashLoopBackOff

Symptoms:

  • Pod repeatedly crashing and restarting
  • Restart count keeps increasing

Quick Check:

```bash

# View current logs

kubectl logs -n

# View previous container logs (most useful)

kubectl logs -n --previous

# Get exit code

kubectl get pod -n -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

```

Common Exit Codes:

  • 0: Success (shouldn't crash)
  • 1: Application error
  • 137: SIGKILL (OOMKilled - out of memory)
  • 139: SIGSEGV (segmentation fault)
  • 143: SIGTERM (terminated)

Quick Fixes:

```bash

# Check for OOMKilled

kubectl describe pod -n | grep -i oom

# Increase memory limit

kubectl set resources deployment \

-c \

--limits=memory=512Mi

# Check liveness/readiness probes

kubectl get pod -n -o yaml | grep -A 10 Probe

```

ImagePullBackOff

Symptoms:

  • Pod can't pull container image
  • ErrImagePull or ImagePullBackOff status

Quick Check:

```bash

kubectl describe pod -n | grep -A 5 "Failed to pull image"

```

Common Causes:

  1. ECR Authentication: Service account lacks ECR pull permissions
  2. Image Doesn't Exist: Wrong repository, tag, or region
  3. Rate Limiting: Docker Hub rate limits exceeded
  4. Registry Unreachable: Network connectivity issues

Quick Fixes:

```bash

# Check if image exists (for ECR)

aws ecr describe-images --repository-name --image-ids imageTag=

# Verify IRSA role has ECR permissions

kubectl describe serviceaccount -n | grep Annotations

# For ECR - ensure IAM role has this policy:

# AmazonEC2ContainerRegistryReadOnly or ecr:GetAuthorizationToken, ecr:BatchGetImage

# Test image pull manually on node

kubectl debug node/ -it --image=busybox

# Then: docker pull

```

OOMKilled

Symptoms:

  • Container killed due to out of memory
  • Exit code 137
  • Last State shows "OOMKilled"

Quick Check:

```bash

# Check memory limits

kubectl get pod -n -o yaml | grep -A 3 "limits:"

# Check actual memory usage

kubectl top pod -n

```

Quick Fix:

```bash

# Increase memory limit

kubectl set resources deployment \

--limits=memory=1Gi \

--requests=memory=512Mi

```

Common Node Issues

Node NotReady

Quick Check:

```bash

kubectl get nodes

kubectl describe node | grep -A 10 Conditions

```

Common Causes:

  1. Disk Pressure: Node running out of disk space
  2. Memory Pressure: Node running out of memory
  3. Network Issues: kubelet can't reach API server
  4. Kubelet Issues: kubelet service crashed or unhealthy

Quick Fixes:

```bash

# Check node conditions

kubectl describe node | grep -E "MemoryPressure|DiskPressure|PIDPressure"

# For EKS managed nodes - check EC2 instance health

aws ec2 describe-instance-status --instance-ids

# Drain and delete node (if managed node group - ASG will replace)

kubectl drain --ignore-daemonsets --delete-emptydir-data

kubectl delete node

```

IP Address Exhaustion

Symptoms:

  • Pods can't get IP addresses
  • Nodes showing high ENI usage
  • CNI errors in logs

Quick Check:

```bash

# Check VPC CNI logs

kubectl logs -n kube-system -l k8s-app=aws-node --tail=100

# Check IP addresses per node

kubectl get nodes -o custom-columns=NAME:.metadata.name,ADDRESSES:.status.addresses[*].address

# Check ENI usage

aws ec2 describe-instances --instance-ids \

--query 'Reservations[].Instances[].NetworkInterfaces[].PrivateIpAddresses'

```

Quick Fixes:

```bash

# Enable prefix delegation (for new nodes)

kubectl set env daemonset aws-node \

-n kube-system \

ENABLE_PREFIX_DELEGATION=true

# Increase warm pool targets

kubectl set env daemonset aws-node \

-n kube-system \

WARM_IP_TARGET=5

```

Essential kubectl Commands

Information Gathering

```bash

# Pod debugging

kubectl get pods -n -o wide

kubectl describe pod -n

kubectl logs -n --tail=100 -f

kubectl logs -n -c # Multi-container pods

kubectl logs -n --previous # Previous crash logs

# Node debugging

kubectl get nodes -o wide

kubectl describe node

kubectl top nodes

kubectl top pods -n

# Events (VERY useful)

kubectl get events -n --sort-by='.lastTimestamp'

kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50

# Resource usage

kubectl top pods -n --containers

kubectl describe node | grep -A 5 "Allocated resources"

```

Advanced Debugging

```bash

# Execute command in running pod

kubectl exec -it -n -- /bin/sh

# Port forward for local testing

kubectl port-forward -n 8080:80

# Debug with ephemeral container (K8s 1.23+)

kubectl debug -n -it --image=busybox

# Debug node issues

kubectl debug node/ -it --image=ubuntu

# Copy files from pod

kubectl cp /:/path/to/file ./local-file

# Get pod YAML

kubectl get pod -n -o yaml

# Get all resources in namespace

kubectl get all -n

kubectl api-resources --verbs=list --namespaced -o name \

| xargs -n 1 kubectl get --show-kind --ignore-not-found -n

```

Filtering & Formatting

```bash

# Get pods not running

kubectl get pods --all-namespaces --field-selector=status.phase!=Running

# Get pods by label

kubectl get pods -n -l app=myapp

# Custom columns

kubectl get pods -n -o custom-columns=\

NAME:.metadata.name,\

STATUS:.status.phase,\

NODE:.spec.nodeName,\

IP:.status.podIP

# JSONPath queries

kubectl get pods -n -o jsonpath='{.items[*].metadata.name}'

# Get pod restart count

kubectl get pods -n -o jsonpath=\

'{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

```

AWS-Specific Troubleshooting

EKS Cluster Health

```bash

# Check EKS cluster status

aws eks describe-cluster --name \

--query 'cluster.{Status:status,Health:health,Version:version}'

# List all EKS clusters

aws eks list-clusters

# Check addon status

aws eks list-addons --cluster-name

aws eks describe-addon --cluster-name --addon-name vpc-cni

# Update kubeconfig

aws eks update-kubeconfig --name --region

```

IAM/IRSA Troubleshooting

```bash

# Check service account IRSA annotation

kubectl get sa -n -o yaml | grep eks.amazonaws.com/role-arn

# Verify pod has correct service account

kubectl get pod -n -o yaml | grep serviceAccountName

# Check if pod has AWS credentials

kubectl exec -n -- env | grep AWS

# Test IAM permissions from pod

kubectl exec -n -- aws sts get-caller-identity

kubectl exec -n -- aws s3 ls # Test S3 access

```

ECR Authentication

```bash

# Get ECR login password

aws ecr get-login-password --region

# Test ECR access

aws ecr describe-repositories --region

# Check if IAM role/user has ECR permissions

aws iam get-role --role-name

aws iam list-attached-role-policies --role-name

```

Karpenter Troubleshooting

```bash

# Check Karpenter logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100 -f

# Check provisioner configuration

kubectl get provisioner -o yaml

# Check Karpenter controller status

kubectl get pods -n karpenter

# Debug why nodes not provisioning

kubectl describe provisioner default

kubectl get events -n karpenter --sort-by='.lastTimestamp'

```

Performance Diagnostics

Resource Contention

```bash

# Check node resource usage

kubectl top nodes

# Check pod resource usage

kubectl top pods -n --sort-by=memory

kubectl top pods -n --sort-by=cpu

# Check resource requests vs limits

kubectl get pods -n -o custom-columns=\

NAME:.metadata.name,\

CPU_REQ:.spec.containers[*].resources.requests.cpu,\

CPU_LIM:.spec.containers[*].resources.limits.cpu,\

MEM_REQ:.spec.containers[*].resources.requests.memory,\

MEM_LIM:.spec.containers[*].resources.limits.memory

```

Application Performance

```bash

# Check pod readiness/liveness probes

kubectl get pods -n -o yaml | grep -A 10 Probe

# Check pod startup time

kubectl describe pod -n | grep Started

# Profile application in pod

kubectl exec -n -- top

kubectl exec -n -- netstat -tuln

```

Log Analysis

CloudWatch Container Insights

```bash

# Enable Container Insights (if not enabled)

# Via AWS CLI:

aws eks update-cluster-config \

--name \

--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

# Check control plane logs in CloudWatch

# Log groups:

# - /aws/eks//cluster

# - /aws/containerinsights//application

# - /aws/containerinsights//host

# - /aws/containerinsights//dataplane

```

Fluent Bit Logs

```bash

# Check Fluent Bit daemonset

kubectl get pods -n amazon-cloudwatch -l k8s-app=fluent-bit

# Check Fluent Bit logs

kubectl logs -n amazon-cloudwatch -l k8s-app=fluent-bit --tail=50

```

Emergency Procedures

Pod Stuck Terminating

```bash

# Force delete pod

kubectl delete pod -n --grace-period=0 --force

# If still stuck, remove finalizers

kubectl patch pod -n -p '{"metadata":{"finalizers":null}}'

```

Node Stuck Draining

```bash

# Force drain

kubectl drain --ignore-daemonsets --delete-emptydir-data --force

# If still stuck, delete pods directly

kubectl delete pods -n --field-selector spec.nodeName= --force --grace-period=0

```

Cluster Unresponsive

```bash

# Check API server health

kubectl get --raw /healthz

# Check control plane logs (CloudWatch)

aws logs tail /aws/eks//cluster --follow

# Restart coredns if DNS issues

kubectl rollout restart deployment coredns -n kube-system

# Check etcd health (EKS manages this, but can check API responsiveness)

time kubectl get nodes # Should be < 1 second

```

Prevention Best Practices

Resource Management

```yaml

# Always set resource requests and limits

resources:

requests:

memory: "256Mi"

cpu: "250m"

limits:

memory: "512Mi"

cpu: "500m" # CPU limits optional, can cause throttling

```

Health Checks

```yaml

# Implement probes for all applications

livenessProbe:

httpGet:

path: /healthz

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3

readinessProbe:

httpGet:

path: /ready

port: 8080

initialDelaySeconds: 5

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 3

startupProbe: # For slow-starting apps

httpGet:

path: /healthz

port: 8080

failureThreshold: 30

periodSeconds: 10

```

High Availability

```yaml

# Pod disruption budgets

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: app-pdb

spec:

minAvailable: 2

selector:

matchLabels:

app: myapp

```

Reference Guides

For detailed troubleshooting of specific areas:

  • [Cluster & Node Issues](references/cluster-issues.md) - Control plane, nodes, autoscaling, etcd
  • [Workload Issues](references/workload-issues.md) - Pods, deployments, jobs, statefulsets
  • [Networking Issues](references/networking-issues.md) - DNS, connectivity, load balancers, CNI

Additional Resources

AWS Documentation:

  • [EKS Best Practices Guide](https://aws.github.io/aws-eks-best-practices/)
  • [EKS User Guide](https://docs.aws.amazon.com/eks/latest/userguide/)
  • [EKS Troubleshooting](https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html)

Kubernetes Documentation:

  • [Kubernetes Debugging Guide](https://kubernetes.io/docs/tasks/debug/)
  • [Application Introspection and Debugging](https://kubernetes.io/docs/tasks/debug/debug-application/)

Tools:

  • [kubectl-debug](https://github.com/aylei/kubectl-debug) - Advanced debugging
  • [Popeye](https://github.com/derailed/popeye) - Cluster sanitizer
  • [kube-bench](https://github.com/aquasecurity/kube-bench) - CIS benchmark checker

---

Quick Start: Use diagnostic workflow above β†’ Identify issue type β†’ Jump to relevant reference guide

Last Updated: November 27, 2025 (2025 AWS Best Practices)