🎯

ecs-troubleshooting

🎯Skill

from adaptationio/skrillz

What it does

ecs-troubleshooting skill from adaptationio/skrillz

📦

Part of

adaptationio/skrillz(191 items)

ecs-troubleshooting

Installation

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add adaptationio/Skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@adaptationio-Skrillz

Claude CodeAdd plugin in Claude Code

/plugin enable skrillz@adaptationio-Skrillz

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add /path/to/skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@local

+ 4 more commands

📖 Extracted from docs: adaptationio/skrillz

Need more details? View full documentation on GitHub →

1Installs

Last UpdatedJan 16, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

ECS troubleshooting and debugging guide covering task failures, service issues, networking problems, and performance diagnostics. Use when diagnosing ECS issues, debugging task failures (STOPPED, PENDING), resolving networking problems, investigating IAM/permissions errors, troubleshooting container health checks, or analyzing ECS service health.

Overview

# ECS Troubleshooting Guide

Complete guide to diagnosing and resolving common ECS issues.

Quick Diagnostic Commands

```bash

# Check service status

aws ecs describe-services \

--cluster production \

--services my-service \

--query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

# List stopped tasks (failures)

aws ecs list-tasks \

--cluster production \

--service-name my-service \

--desired-status STOPPED

# Describe stopped task

aws ecs describe-tasks \

--cluster production \

--tasks \

--query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# View recent logs

aws logs tail /ecs/my-app --since 1h --follow

# Execute into container (debug)

aws ecs execute-command \

--cluster production \

--task \

--container my-app \

--interactive \

--command "/bin/sh"

```

Task Failures

Task Status: STOPPED

#### Symptom

Tasks immediately stop after starting or fail to start.

#### Diagnostic Steps

```python

import boto3

ecs = boto3.client('ecs')

def diagnose_stopped_task(cluster: str, task_arn: str):

"""Diagnose why a task stopped"""

response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])

task = response['tasks'][0]

print(f"Task Status: {task['lastStatus']}")

print(f"Stop Code: {task.get('stopCode', 'N/A')}")

print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")

for container in task['containers']:

print(f"\nContainer: {container['name']}")

print(f" Status: {container['lastStatus']}")

print(f" Exit Code: {container.get('exitCode', 'N/A')}")

print(f" Reason: {container.get('reason', 'N/A')}")

```

#### Common Causes & Solutions

1. Essential container failed

```

stoppedReason: "Essential container in task exited"

```

Solution: Check container logs for application errors

```bash

aws logs tail /ecs/my-app --since 30m

```

2. Task failed to start

```

stoppedReason: "Task failed to start"

```

Solution: Check execution role permissions

```bash

# Verify execution role can pull image

aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access

```

3. CannotPullContainerError

```

reason: "CannotPullContainerError: Error response from daemon"

```

Solutions:

Check ECR permissions in execution role
Verify image exists: aws ecr describe-images --repository-name my-app
Check VPC endpoints or NAT gateway for private subnets

4. OutOfMemoryError

```

reason: "OutOfMemoryError: Container killed due to memory usage"

exitCode: 137

```

Solution: Increase memory in task definition

```hcl

memory = 2048 # Increase from current value

```

5. Exit Code 1 (Application Error)

```

exitCode: 1

```

Solution: Check application logs for errors

```bash

aws logs filter-events \

--log-group-name /ecs/my-app \

--filter-pattern "ERROR"

```

Task Status: PENDING

#### Symptom

Tasks stuck in PENDING state, not transitioning to RUNNING.

#### Diagnostic Steps

```python

def diagnose_pending_tasks(cluster: str, service: str):

"""Check why tasks are stuck in PENDING"""

# List pending tasks

pending = ecs.list_tasks(

cluster=cluster,

serviceName=service,

desiredStatus='RUNNING'

)

for task_arn in pending['taskArns']:

task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]

if task['lastStatus'] == 'PENDING':

print(f"Task {task_arn.split('/')[-1]} is PENDING")

# Check attachments for ENI issues

for attachment in task.get('attachments', []):

print(f" Attachment: {attachment['type']} - {attachment['status']}")

for detail in attachment.get('details', []):

print(f" {detail['name']}: {detail['value']}")

```

#### Common Causes & Solutions

1. No available capacity

```

Service my-service was unable to place a task because no container instance met all of its requirements

```

Solutions for Fargate:

Check capacity provider limits
Verify subnet has available IPs
Check if region/AZ has Fargate capacity

2. ENI provisioning issues

```

Attachment status: PRECREATED

```

Solutions:

Check security group allows required traffic
Verify subnet has available IPs
Check ENI limits for EC2 instances

3. Image pull taking too long

```

Container image: pulling

```

Solutions:

Check image size (use smaller base images)
Verify network connectivity to ECR
Use VPC endpoints for faster pulls

Service Issues

Service Not Starting Tasks

#### Diagnostic

```bash

# Check service events

aws ecs describe-services \

--cluster production \

--services my-service \

--query 'services[0].events[:10]'

```

#### Common Events & Solutions

1. "service my-service is unable to place a task"

Check task placement constraints and capacity.

2. "service my-service has reached a steady state"

Service is healthy - tasks are running as expected.

3. "service my-service was unable to place a task because no container instance met all requirements"

For Fargate: Check CPU/memory configurations are valid combinations.

Deployment Stuck

#### Symptom

Deployment never reaches COMPLETED state.

#### Diagnostic

```python

def check_deployment_status(cluster: str, service: str):

"""Check deployment progress"""

response = ecs.describe_services(cluster=cluster, services=[service])

svc = response['services'][0]

for deployment in svc['deployments']:

print(f"\nDeployment: {deployment['id']}")

print(f" Status: {deployment['status']}")

print(f" Rollout State: {deployment['rolloutState']}")

print(f" Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")

if deployment['rolloutState'] == 'IN_PROGRESS':

reason = deployment.get('rolloutStateReason', '')

print(f" Reason: {reason}")

```

#### Common Causes

1. Health check failures

```

rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"

```

Solutions:

Check target group health check settings
Increase healthCheckGracePeriodSeconds
Verify application responds on health check path

2. Insufficient capacity

```

rolloutStateReason: "Service my-service was unable to place a task"

```

Solutions:

Check subnet IP availability
Reduce maximumPercent to allow more headroom

Networking Issues

Tasks Cannot Connect to Internet

#### Symptoms

Cannot pull images
Cannot reach external APIs
Timeouts on external calls

#### Solutions

For private subnets:

```hcl

# Option 1: NAT Gateway

resource "aws_nat_gateway" "main" {

allocation_id = aws_eip.nat.id

subnet_id = aws_subnet.public.id

}

# Option 2: VPC Endpoints (recommended)

resource "aws_vpc_endpoint" "ecr_api" {

vpc_id = aws_vpc.main.id

service_name = "com.amazonaws.us-east-1.ecr.api"

vpc_endpoint_type = "Interface"

subnet_ids = aws_subnet.private[*].id

}

```

Tasks Cannot Connect to Each Other

#### Symptom

Service-to-service communication fails.

#### Diagnostic

```bash

# Check security group rules

aws ec2 describe-security-groups \

--group-ids sg-12345 \

--query 'SecurityGroups[0].IpPermissions'

```

#### Solutions

```hcl

# Allow traffic between ECS tasks

resource "aws_security_group_rule" "ecs_to_ecs" {

type = "ingress"

from_port = 8080

to_port = 8080

protocol = "tcp"

security_group_id = aws_security_group.ecs_tasks.id

source_security_group_id = aws_security_group.ecs_tasks.id

}

```

Load Balancer Health Checks Failing

#### Symptom

```

Target group app-tg: 0 healthy, 3 unhealthy

```

#### Diagnostic

```bash

# Check target health

aws elbv2 describe-target-health \

--target-group-arn

```

#### Common Causes & Solutions

1. Wrong health check path

```hcl

health_check {

path = "/health" # Must match application endpoint

}

```

2. Container not listening on expected port

```bash

# Verify inside container

aws ecs execute-command --cluster production --task \

--container my-app --interactive --command "netstat -tlnp"

```

3. Security group blocking ALB

```hcl

# Allow ALB to reach ECS tasks

resource "aws_security_group_rule" "alb_to_ecs" {

type = "ingress"

from_port = 8080

to_port = 8080

protocol = "tcp"

security_group_id = aws_security_group.ecs_tasks.id

source_security_group_id = aws_security_group.alb.id

}

```

IAM & Permissions Issues

CannotPullContainerError

#### Symptom

```

CannotPullContainerError: Error response from daemon: pull access denied

```

#### Solution: Task Execution Role

```hcl

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {

role = aws_iam_role.ecs_task_execution.name

policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"

}

# For cross-account ECR

resource "aws_iam_role_policy" "cross_account_ecr" {

role = aws_iam_role.ecs_task_execution.id

policy = jsonencode({

Version = "2012-10-17"

Statement = [{

Effect = "Allow"

Action = [

"ecr:GetDownloadUrlForLayer",

"ecr:BatchGetImage"

]

Resource = "arn:aws:ecr::OTHER_ACCOUNT:repository/"

}]

})

}

```

Secrets Access Denied

#### Symptom

```

ResourceInitializationError: unable to pull secrets

```

#### Solution

```hcl

resource "aws_iam_role_policy" "secrets_access" {

role = aws_iam_role.ecs_task_execution.id

policy = jsonencode({

Version = "2012-10-17"

Statement = [

{

Effect = "Allow"

Action = ["secretsmanager:GetSecretValue"]

Resource = "arn:aws:secretsmanager:::secret:my-app/*"

{

Effect = "Allow"

Action = ["ssm:GetParameters"]

Resource = "arn:aws:ssm:::parameter/my-app/*"

{

Effect = "Allow"

Action = ["kms:Decrypt"]

Resource = aws_kms_key.secrets.arn

}

]

})

}

```

Execute Command Not Working

#### Symptom

```

SessionManagerPlugin is not found

```

Execute command is disabled

```

#### Solutions

1. Enable execute command on service

```hcl

resource "aws_ecs_service" "app" {

enable_execute_command = true

}

```

2. Add SSM permissions to task role

```hcl

resource "aws_iam_role_policy" "ssm_exec" {

role = aws_iam_role.ecs_task.id

policy = jsonencode({

Version = "2012-10-17"

Statement = [{

Effect = "Allow"

Action = [

"ssmmessages:CreateControlChannel",

"ssmmessages:CreateDataChannel",

"ssmmessages:OpenControlChannel",

"ssmmessages:OpenDataChannel"

]

Resource = "*"

}]

})

}

```

Performance Issues

High CPU/Memory Usage

#### Diagnostic

```python

import boto3

cloudwatch = boto3.client('cloudwatch')

def get_service_metrics(cluster: str, service: str):

"""Get CPU and memory metrics"""

response = cloudwatch.get_metric_statistics(

Namespace='AWS/ECS',

MetricName='CPUUtilization',

Dimensions=[

{'Name': 'ClusterName', 'Value': cluster},

{'Name': 'ServiceName', 'Value': service}

StartTime=datetime.utcnow() - timedelta(hours=1),

EndTime=datetime.utcnow(),

Period=300,

Statistics=['Average', 'Maximum']

)

for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):

print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")

```

#### Solutions

1. Right-size tasks

```hcl

# Increase resources

cpu = "1024" # from 512

memory = "2048" # from 1024

```

2. Enable auto-scaling

```hcl

resource "aws_appautoscaling_policy" "cpu" {

target_tracking_scaling_policy_configuration {

target_value = 70.0

}

```

Slow Task Startup

#### Causes & Solutions

1. Large container image

Use smaller base images (alpine, distroless)
Enable image caching with Fargate Platform 1.4.0

2. Slow application startup

Increase startPeriod in health check
Optimize application initialization

3. Slow secret/config loading

Use VPC endpoints for faster access
Cache configuration at startup

Log Analysis

CloudWatch Logs Queries

```bash

# Find errors in last hour

aws logs filter-events \

--log-group-name /ecs/my-app \

--start-time $(date -d '-1 hour' +%s000) \

--filter-pattern "ERROR"

# Find OOM kills

aws logs filter-events \

--log-group-name /ecs/my-app \

--filter-pattern "OutOfMemory"

# Find slow requests

aws logs filter-events \

--log-group-name /ecs/my-app \

--filter-pattern "[timestamp, level, duration>1000, ...]"

```

CloudWatch Insights

```sql

-- Top errors by count

fields @timestamp, @message

| filter @message like /ERROR/

| stats count(*) as errorCount by @message

| sort errorCount desc

| limit 10

-- Average response time

fields @timestamp, responseTime

| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)

```

Related Skills

boto3-ecs: SDK patterns
terraform-ecs: Infrastructure as Code
ecs-fargate: Fargate specifics
ecs-deployment: Deployment strategies

Quick Reference

| Symptom | First Check | Common Cause |

|---------|-------------|--------------|

| Task STOPPED | stoppedReason | Container crash, OOM |

| Task PENDING | Attachments | ENI/network issues |

| Deployment stuck | Health checks | ALB health check failing |

| Cannot pull image | Execution role | Missing ECR permissions |

| Cannot connect | Security groups | Wrong SG rules |

More from this repository10

🎯

xai-stock-sentiment🎯Skill

xai-stock-sentiment skill from adaptationio/skrillz

🎯

ralph-prompt-single-task🎯Skill

Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.

🎯

autonomous-loop🎯Skill

Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.

🎯

auto-claude-troubleshooting🎯Skill

auto-claude-troubleshooting skill from adaptationio/skrillz

🎯

auto-claude-setup🎯Skill

auto-claude-setup skill from adaptationio/skrillz

🎯

observability-analyzer🎯Skill

Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.

🎯

xai-financial-integration🎯Skill

xai-financial-integration skill from adaptationio/skrillz

🎯

xai-agent-tools🎯Skill

xai-agent-tools skill from adaptationio/skrillz

🎯

xai-crypto-sentiment🎯Skill

xai-crypto-sentiment skill from adaptationio/skrillz

🎯

auto-claude-memory🎯Skill

auto-claude-memory skill from adaptationio/skrillz