🎯

eks-observability

🎯Skill

from adaptationio/skrillz

What it does

eks-observability skill from adaptationio/skrillz

📦

Part of

adaptationio/skrillz(191 items)

eks-observability

Installation

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add adaptationio/Skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@adaptationio-Skrillz

Claude CodeAdd plugin in Claude Code

/plugin enable skrillz@adaptationio-Skrillz

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add /path/to/skrillz

Install PluginInstall plugin from marketplace

/plugin install skrillz@local

+ 4 more commands

📖 Extracted from docs: adaptationio/skrillz

Need more details? View full documentation on GitHub →

1Installs

Last UpdatedJan 16, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

EKS observability with metrics, logging, and tracing. Use when setting up monitoring, configuring logging pipelines, implementing distributed tracing, building production dashboards, troubleshooting EKS issues, optimizing observability costs, or establishing SLOs.

Overview

# EKS Observability

Overview

Complete observability solution for Amazon EKS using AWS-native managed services and open-source tools. This skill implements the three-pillar approach (metrics, logs, traces) with 2025 best practices including ADOT, Amazon Managed Prometheus, Fluent Bit, and OpenTelemetry.

Keywords: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack

Status: Production-ready with 2025 best practices

When to Use This Skill

Setting up monitoring for EKS clusters
Implementing centralized logging pipelines
Configuring distributed tracing
Building production dashboards in Grafana
Troubleshooting application performance
Establishing SLOs and error budgets
Optimizing observability costs
Migrating from X-Ray SDKs to OpenTelemetry
Correlating metrics, logs, and traces
Setting up alerting and on-call runbooks

The Three-Pillar Approach (2025 Recommendation)

1. Metrics

CloudWatch Container Insights + Amazon Managed Prometheus (AMP)

Dual monitoring provides complete visibility
CloudWatch for AWS-native integration and quick setup
Prometheus for advanced queries and community dashboards
Amazon Managed Grafana for visualization

2. Logs

Fluent Bit → CloudWatch Logs

Lightweight log forwarder (AWS deprecated FluentD in Feb 2025)
DaemonSet deployment for automatic collection
Structured logging with JSON parsing
Optional aggregation to OpenSearch for analytics

3. Traces

ADOT → AWS X-Ray

OpenTelemetry standard (X-Ray SDKs entering maintenance mode 2026)
ADOT Collector converts OTLP to X-Ray format
Distributed tracing across microservices
Integration with CloudWatch ServiceLens

Quick Start Workflow

Step 1: Enable CloudWatch Container Insights

Using EKS Add-on (Recommended):

```bash

# Create IAM policy for CloudWatch access

aws iam create-policy \

--policy-name CloudWatchAgentServerPolicy \

--policy-document file://cloudwatch-policy.json

# Create IRSA for CloudWatch

eksctl create iamserviceaccount \

--name cloudwatch-agent \

--namespace amazon-cloudwatch \

--cluster my-cluster \

--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \

--approve \

--override-existing-serviceaccounts

# Install Container Insights add-on

aws eks create-addon \

--cluster-name my-cluster \

--addon-name amazon-cloudwatch-observability \

--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole

```

Verify Installation:

```bash

# Check add-on status

aws eks describe-addon \

--cluster-name my-cluster \

--addon-name amazon-cloudwatch-observability

# Verify pods running

kubectl get pods -n amazon-cloudwatch

```

What You Get:

Node-level metrics (CPU, memory, disk, network)
Pod-level metrics (resource usage, restart counts)
Namespace-level aggregations
Automatic CloudWatch Logs integration
Pre-built CloudWatch dashboards

Step 2: Deploy Amazon Managed Prometheus

Create AMP Workspace:

```bash

# Create workspace

aws amp create-workspace \

--alias my-cluster-metrics \

--region us-west-2

# Get workspace ID

WORKSPACE_ID=$(aws amp list-workspaces \

--alias my-cluster-metrics \

--query 'workspaces[0].workspaceId' \

--output text)

# Create IRSA for AMP ingestion

eksctl create iamserviceaccount \

--name amp-ingest \

--namespace prometheus \

--cluster my-cluster \

--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \

--approve

```

Deploy kube-prometheus-stack:

```bash

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

# Install with AMP remote write

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \

--namespace prometheus \

--create-namespace \

--set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \

--set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \

--set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole"

```

What You Get:

Prometheus Operator for CRD-based monitoring
Node Exporter for hardware metrics
kube-state-metrics for cluster state
Alertmanager for alert routing
100+ pre-built Grafana dashboards

Step 3: Deploy Fluent Bit for Logging

Create IRSA for Fluent Bit:

```bash

eksctl create iamserviceaccount \

--name fluent-bit \

--namespace logging \

--cluster my-cluster \

--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \

--approve

```

Deploy Fluent Bit:

```bash

helm repo add fluent https://fluent.github.io/helm-charts

helm install fluent-bit fluent/fluent-bit \

--namespace logging \

--create-namespace \

--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \

--set cloudWatch.enabled=true \

--set cloudWatch.region=us-west-2 \

--set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \

--set cloudWatch.autoCreateGroup=true

```

What You Get:

Automatic log collection from all pods
Structured JSON log parsing
CloudWatch Logs integration
Multi-line log support
Kubernetes metadata enrichment

Step 4: Deploy ADOT for Distributed Tracing

Install ADOT Operator:

```bash

# Create IRSA for ADOT

eksctl create iamserviceaccount \

--name adot-collector \

--namespace adot \

--cluster my-cluster \

--attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \

--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \

--approve

# Install ADOT add-on

aws eks create-addon \

--cluster-name my-cluster \

--addon-name adot \

--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole

```

Deploy ADOT Collector:

```yaml

# adot-collector.yaml

apiVersion: opentelemetry.io/v1alpha1

kind: OpenTelemetryCollector

metadata:

name: adot-collector

namespace: adot

spec:

mode: deployment

serviceAccount: adot-collector

config: |

receivers:

otlp:

protocols:

grpc:

endpoint: 0.0.0.0:4317

http:

endpoint: 0.0.0.0:4318

processors:

batch:

timeout: 30s

send_batch_size: 50

memory_limiter:

check_interval: 1s

limit_mib: 512

exporters:

awsxray:

region: us-west-2

awsemf:

region: us-west-2

namespace: EKS/Observability

service:

pipelines:

traces:

receivers: [otlp]

processors: [memory_limiter, batch]

exporters: [awsxray]

metrics:

receivers: [otlp]

processors: [memory_limiter, batch]

exporters: [awsemf]

```

```bash

kubectl apply -f adot-collector.yaml

```

What You Get:

OTLP receiver for OpenTelemetry traces
Automatic X-Ray integration
Service map visualization
Trace sampling and filtering
CloudWatch ServiceLens integration

Step 5: Setup Amazon Managed Grafana

Create AMG Workspace:

```bash

# Create workspace (via AWS Console recommended)

# Or use AWS CLI:

aws grafana create-workspace \

--workspace-name my-cluster-grafana \

--account-access-type CURRENT_ACCOUNT \

--authentication-providers AWS_SSO \

--permission-type SERVICE_MANAGED

```

Add Data Sources:

Navigate to AMG workspace URL
Configuration → Data Sources → Add data source
Add Amazon Managed Service for Prometheus

- Region: us-west-2

- Workspace: Select your AMP workspace

Add CloudWatch

- Default region: us-west-2

- Namespaces: ContainerInsights, EKS/Observability

Add AWS X-Ray

- Default region: us-west-2

Import Dashboards:

```bash

# EKS Container Insights Dashboard

Dashboard ID: 16028

# Node Exporter Full Dashboard

Dashboard ID: 1860

# Kubernetes Cluster Monitoring

Dashboard ID: 15760

```

Production Deployment Checklist

Infrastructure

[ ] CloudWatch Container Insights enabled (EKS add-on)
[ ] Amazon Managed Prometheus workspace created
[ ] kube-prometheus-stack deployed with remote write
[ ] Fluent Bit DaemonSet running on all nodes
[ ] ADOT Collector deployed (deployment or daemonset)
[ ] Amazon Managed Grafana workspace created
[ ] All IRSA roles configured with least-privilege policies

Configuration

[ ] Prometheus scrape configs include all targets
[ ] Fluent Bit log groups created and structured
[ ] ADOT sampling configured (5-10% for high traffic)
[ ] Grafana data sources connected (AMP, CloudWatch, X-Ray)
[ ] Log retention policies set (7-90 days typical)
[ ] Metric retention configured (AMP default 150 days)

Dashboards

[ ] Cluster overview dashboard (nodes, pods, namespaces)
[ ] Application performance dashboard (latency, errors, throughput)
[ ] Resource utilization dashboard (CPU, memory, disk)
[ ] Cost monitoring dashboard (resource waste, right-sizing)
[ ] Network performance dashboard (CNO metrics)

Alerting

[ ] Critical alerts: Pod crash loops, node not ready
[ ] Performance alerts: High latency, error rate spikes
[ ] Resource alerts: CPU/memory pressure, disk full
[ ] Cost alerts: Budget thresholds, waste detection
[ ] SNS topics configured for notifications
[ ] PagerDuty/Opsgenie integration (optional)

Application Instrumentation

[ ] OpenTelemetry SDK integrated in applications
[ ] Trace context propagation configured
[ ] Custom metrics exported via OTLP
[ ] Structured logging with JSON format
[ ] Log correlation with trace IDs

Modern Observability Stack (2025)

```

┌─────────────────────────────────────────────────────────────┐

│ EKS Cluster │

│ │

│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │

│ │ Application │ │ Application │ │ Application │ │

│ │ + OTel SDK │ │ + OTel SDK │ │ + OTel SDK │ │

│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │

│ │ │ │ │

│ └──────────────────┴──────────────────┘ │

│ │ │

│ ┌────────▼────────┐ │

│ │ ADOT Collector │ │

│ │ (OTel) │ │

│ └────────┬────────┘ │

│ │ │

│ ┌──────────────────┼──────────────────┐ │

│ │ │ │ │

│ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │

│ │Prometheus│ │Fluent Bit│ │Container │ │

│ │ (local) │ │DaemonSet │ │ Insights │ │

│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │

└─────────┼──────────────────┼──────────────────┼────────────┘

│ │ │

┌─────▼─────┐ ┌────▼─────┐ ┌────▼─────┐

│ AMP │ │CloudWatch│ │ X-Ray │

│(Managed │ │ Logs │ │ │

│Prometheus)│ └────┬─────┘ └────┬─────┘

└─────┬─────┘ │ │

│ │ │

└─────────────────┴──────────────────┘

│

┌────────▼────────┐

│Amazon Managed │

│ Grafana │

└─────────────────┘

```

Detailed Documentation

For comprehensive guides on each observability component:

Metrics Collection: [references/metrics.md](references/metrics.md)

- CloudWatch Container Insights setup

- Amazon Managed Prometheus configuration

- kube-prometheus-stack deployment

- Custom metrics and ServiceMonitors

- Cost optimization strategies

Centralized Logging: [references/logging.md](references/logging.md)

- Fluent Bit configuration and parsers

- CloudWatch Logs integration

- OpenSearch aggregation (optional)

- Log retention and lifecycle policies

- Troubleshooting log collection

Distributed Tracing: [references/tracing.md](references/tracing.md)

- ADOT Collector deployment patterns

- OpenTelemetry SDK instrumentation

- X-Ray integration and migration

- Trace sampling strategies

- ServiceLens and trace analysis

Cost Optimization

Metrics

Sample high-cardinality metrics (5-10% of labels)
Use metric relabeling to drop unnecessary labels
Aggregate metrics before remote write to AMP
Set appropriate retention periods (30-90 days typical)

Logs

Implement log sampling for verbose applications
Use CloudWatch Logs Insights instead of exporting to S3
Set aggressive retention for debug logs (7 days)
Keep audit logs longer (90+ days)

Traces

Sample traces based on traffic (5-10% default)
Increase sampling for errors (100%)
Use tail-based sampling for important transactions
Clean up old X-Ray traces (default 30 days)

Typical Monthly Costs:

Small cluster (10 nodes): $50-150/month
Medium cluster (50 nodes): $200-500/month
Large cluster (200+ nodes): $1000-2000/month

Integration Patterns

Correlation Between Pillars

Metrics → Logs:

```promql

# Find pods with high error rates

rate(http_requests_total{status=~"5.."}[5m]) > 0.1

# Then search CloudWatch Logs for those pod names

```

Logs → Traces:

```json

// Include trace_id in structured logs

{

"timestamp": "2025-01-27T10:30:00Z",

"level": "error",

"message": "Database connection failed",

"trace_id": "1-67a2f3b1-12456789abcdef012345678",

"span_id": "abcdef0123456789"

}

```

Traces → Metrics:

Use trace data to identify slow endpoints
Create SLIs from trace latency percentiles
Alert on trace error rates

CloudWatch ServiceLens

Unified view combining:

X-Ray traces (request flow)
CloudWatch metrics (performance)
CloudWatch Logs (detailed context)

```bash

# Enable ServiceLens (automatic with Container Insights + X-Ray)

aws servicelens get-service-lens-metrics \

--service-name my-app \

--start-time 2025-01-27T00:00:00Z \

--end-time 2025-01-27T23:59:59Z

```

Troubleshooting Quick Reference

| Issue | Cause | Fix |

|-------|-------|-----|

| No metrics in AMP | Missing IRSA or remote write config | Check Prometheus pod logs, verify IAM role |

| Logs not appearing | Fluent Bit not running or wrong IAM | kubectl logs -n logging fluent-bit-xxx |

| Traces not in X-Ray | ADOT not deployed or app not instrumented | Verify ADOT pods, check OTel SDK setup |

| High costs | Too much data ingestion | Enable sampling, reduce log verbosity |

| Missing pod metrics | kube-state-metrics not running | Check kube-prometheus-stack installation |

| Grafana can't connect | Data source IAM permissions | Add CloudWatch/AMP read policies to AMG role |

Production Runbooks

Incident Response

Check Grafana overview dashboard - Identify affected services
Review X-Ray service map - Find bottleneck in request flow
Query CloudWatch Logs Insights - Get detailed error messages
Correlate with metrics spike - Understand timeline and scope
Execute remediation - Scale, restart, or rollback

Performance Investigation

Start with RED metrics (Rate, Errors, Duration)
Check USE metrics (Utilization, Saturation, Errors) for infrastructure
Analyze trace percentiles (p50, p95, p99)
Review log patterns during slow periods
Identify optimization opportunities

SLO Implementation

Define SLIs (Service Level Indicators):

```yaml

# Availability SLI

metric: probe_success

target: 99.9%

window: 30d

# Latency SLI

metric: http_request_duration_seconds

percentile: p99

target: < 500ms

window: 30d

# Error Rate SLI

metric: http_requests_total{status=~"5.."}

target: < 0.1%

window: 30d

```

Calculate Error Budget:

```

Error Budget = 100% - SLO Target

Example: 99.9% SLO = 0.1% error budget

= 43.2 minutes downtime/month

```

Burn Rate Alerts:

```promql

# Fast burn (5% budget in 1 hour)

(1 - slo:availability:ratio_rate_1h) > 0.05

# Slow burn (10% budget in 6 hours)

(1 - slo:availability:ratio_rate_6h) > 0.1

```

Best Practices Summary

Use Dual Monitoring: CloudWatch Container Insights + Prometheus
Standardize on OpenTelemetry: Future-proof instrumentation
Enable IRSA for Everything: No node IAM roles
Deploy ADOT Collector: Vendor-neutral observability
Sample Intelligently: 5-10% traces, 100% errors
Structure Your Logs: JSON format with trace correlation
Set Retention Policies: Balance cost and compliance
Build Actionable Dashboards: Focus on SLIs and anomalies
Implement Progressive Alerting: Warn before critical
Regularly Review Costs: Optimize based on actual usage

---

Stack: CloudWatch Container Insights, AMP, Fluent Bit, ADOT, AMG, X-Ray

Standards: OpenTelemetry, IRSA, EKS Add-ons

Last Updated: January 2025 (2025 Best Practices)

More from this repository10

🎯

xai-stock-sentiment🎯Skill

xai-stock-sentiment skill from adaptationio/skrillz

🎯

ralph-prompt-single-task🎯Skill

Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.

🎯

autonomous-loop🎯Skill

Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.

🎯

auto-claude-troubleshooting🎯Skill

auto-claude-troubleshooting skill from adaptationio/skrillz

🎯

auto-claude-setup🎯Skill

auto-claude-setup skill from adaptationio/skrillz

🎯

observability-analyzer🎯Skill

Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.

🎯

xai-financial-integration🎯Skill

xai-financial-integration skill from adaptationio/skrillz

🎯

xai-agent-tools🎯Skill

xai-agent-tools skill from adaptationio/skrillz

🎯

xai-crypto-sentiment🎯Skill

xai-crypto-sentiment skill from adaptationio/skrillz

🎯

auto-claude-memory🎯Skill

auto-claude-memory skill from adaptationio/skrillz