🎯

incident-runbook-templates

🎯Skill

from rmyndharis/antigravity-skills

VibeIndex|
What it does

Generates structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions for service outages and critical incidents.

πŸ“¦

Part of

rmyndharis/antigravity-skills(289 items)

incident-runbook-templates

Installation

npm runRun npm script
npm run build:catalog
npxRun with npx
npx @rmyndharis/antigravity-skills search <query>
npxRun with npx
npx @rmyndharis/antigravity-skills search kubernetes
npxRun with npx
npx @rmyndharis/antigravity-skills list
npxRun with npx
npx @rmyndharis/antigravity-skills install <skill-name>

+ 15 more commands

πŸ“– Extracted from docs: rmyndharis/antigravity-skills
10Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

Overview

# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Do not use this skill when

  • The task is unrelated to incident runbook templates
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

| Severity | Impact | Response Time | Example |

|----------|--------|---------------|---------|

| SEV1 | Complete outage, data loss | 15 min | Production down |

| SEV2 | Major degradation | 30 min | Critical feature broken |

| SEV3 | Minor impact | 2 hours | Non-critical bug |

| SEV4 | Minimal impact | Next business day | Cosmetic issue |

2. Runbook Structure

```

  1. Overview & Impact
  2. Detection & Alerts
  3. Initial Triage
  4. Mitigation Steps
  5. Root Cause Investigation
  6. Resolution Procedures
  7. Verification & Rollback
  8. Communication Templates
  9. Escalation Matrix

```

Runbook Templates

Template 1: Service Outage Runbook

```markdown

# [Service Name] Outage Runbook

Overview

Service: Payment Processing Service

Owner: Platform Team

Slack: #payments-incidents

PagerDuty: payments-oncall

Impact Assessment

  • [ ] Which customers are affected?
  • [ ] What percentage of traffic is impacted?
  • [ ] Are there financial implications?
  • [ ] What's the blast radius?

Detection

Alerts

  • payment_error_rate > 5% (PagerDuty)
  • payment_latency_p99 > 2s (Slack)
  • payment_success_rate < 95% (PagerDuty)

Dashboards

  • [Payment Service Dashboard](https://grafana/d/payments)
  • [Error Tracking](https://sentry.io/payments)
  • [Dependency Status](https://status.stripe.com)

Initial Triage (First 5 Minutes)

1. Assess Scope

```bash

# Check service health

kubectl get pods -n payments -l app=payment-service

# Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Check error rates

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

```

2. Quick Health Checks

  • [ ] Can you reach the service? curl -I https://api.company.com/payments/health
  • [ ] Database connectivity? Check connection pool metrics
  • [ ] External dependencies? Check Stripe, bank API status
  • [ ] Recent changes? Check deploy history

3. Initial Classification

| Symptom | Likely Cause | Go To Section |

|---------|--------------|---------------|

| All requests failing | Service down | Section 4.1 |

| High latency | Database/dependency | Section 4.2 |

| Partial failures | Code bug | Section 4.3 |

| Spike in errors | Traffic surge | Section 4.4 |

Mitigation Procedures

4.1 Service Completely Down

```bash

# Step 1: Check pod status

kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

```

4.2 High Latency

```bash

# Step 1: Check database connections

kubectl exec -n payments deploy/payment-service -- \

curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c "

SELECT pid, now() - query_start AS duration, query

FROM pg_stat_activity

WHERE state = 'active' AND duration > interval '5 seconds'

ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service \

STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

```

4.3 Partial Failures (Specific Errors)

```bash

# Step 1: Identify error pattern

kubectl logs -n payments -l app=payment-service --tail=500 | \

grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking

# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags \

-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c "

SELECT * FROM audit_log

WHERE table_name = 'payment_methods'

AND created_at > now() - interval '1 hour';"

```

4.4 Traffic Surge

```bash

# Step 1: Check current request rate

kubectl top pods -n payments

# Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting

kubectl set env deployment/payment-service \

RATE_LIMIT_ENABLED=true \

RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs

kubectl apply -f - <

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

name: block-suspicious

namespace: payments

spec:

podSelector:

matchLabels:

app: payment-service

ingress:

- from:

- ipBlock:

cidr: 0.0.0.0/0

except:

- 192.168.1.0/24 # Suspicious range

EOF

```

Verification Steps

```bash

# Verify service is healthy

curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows

./scripts/smoke-test-payments.sh

```

Rollback Procedures

```bash

# Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags \

-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

```

Escalation Matrix

| Condition | Escalate To | Contact |

|-----------|-------------|---------|

| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |

| Data breach suspected | Security Team | #security-incidents |

| Financial impact > $10k | Finance + Legal | @finance-oncall |

| Customer communication needed | Support Lead | @support-lead |

Communication Templates

Initial Notification (Internal)

```

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2

Status: Investigating

Impact: ~20% of payment requests failing

Start Time: [TIME]

Incident Commander: [NAME]

Current Actions:

  • Investigating root cause
  • Scaling up service
  • Monitoring dashboards

Updates in #payments-incidents

```

Status Update

```

πŸ“Š UPDATE: Payment Service Incident

Status: Mitigating

Impact: Reduced to ~5% failure rate

Duration: 25 minutes

Actions Taken:

  • Rolled back deployment v2.3.4 β†’ v2.3.3
  • Scaled service from 5 β†’ 10 replicas

Next Steps:

  • Continuing to monitor
  • Root cause analysis in progress

ETA to Resolution: ~15 minutes

```

Resolution Notification

```

βœ… RESOLVED: Payment Service Incident

Duration: 45 minutes

Impact: ~5,000 affected transactions

Root Cause: Memory leak in v2.3.4

Resolution:

  • Rolled back to v2.3.3
  • Transactions auto-retried successfully

Follow-up:

  • Postmortem scheduled for [DATE]
  • Bug fix in progress

```

```

Template 2: Database Incident Runbook

```markdown

# Database Incident Runbook

Quick Reference

| Issue | Command |

|-------|---------|

| Check connections | SELECT count(*) FROM pg_stat_activity; |

| Kill query | SELECT pg_terminate_backend(pid); |

| Check replication lag | SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())); |

| Check locks | SELECT * FROM pg_locks WHERE NOT granted; |

Connection Pool Exhaustion

```sql

-- Check current connections

SELECT datname, usename, state, count(*)

FROM pg_stat_activity

GROUP BY datname, usename, state

ORDER BY count(*) DESC;

-- Identify long-running connections

SELECT pid, usename, datname, state, query_start, query

FROM pg_stat_activity

WHERE state != 'idle'

ORDER BY query_start;

-- Terminate idle connections

SELECT pg_terminate_backend(pid)

FROM pg_stat_activity

WHERE state = 'idle'

AND query_start < now() - interval '10 minutes';

```

Replication Lag

```sql

-- Check lag on replica

SELECT

CASE

WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0

ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())

END AS lag_seconds;

-- If lag > 60s, consider:

-- 1. Check network between primary/replica

-- 2. Check replica disk I/O

-- 3. Consider failover if unrecoverable

```

Disk Space Critical

```bash

# Check disk usage

df -h /var/lib/postgresql/data

# Find large tables

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10;"

# VACUUM to reclaim space

psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

```

```

Best Practices

Do's

  • Keep runbooks updated - Review after every incident
  • Test runbooks regularly - Game days, chaos engineering
  • Include rollback steps - Always have an escape hatch
  • Document assumptions - What must be true for steps to work
  • Link to dashboards - Quick access during stress

Don'ts

  • Don't assume knowledge - Write for 3 AM brain
  • Don't skip verification - Confirm each step worked
  • Don't forget communication - Keep stakeholders informed
  • Don't work alone - Escalate early
  • Don't skip postmortems - Learn from every incident

Resources

  • [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
  • [PagerDuty Incident Response](https://response.pagerduty.com/)
  • [Atlassian Incident Management](https://www.atlassian.com/incident-management)