🎯

incident-runbook-templates

🎯Skill

from rmyndharis/antigravity-skills

What it does

Generates structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions for service outages and critical incidents.

📦

Part of

rmyndharis/antigravity-skills(289 items)

incident-runbook-templates

Installation

npm runRun npm script

npm run build:catalog

npxRun with npx

npx @rmyndharis/antigravity-skills search <query>

npxRun with npx

npx @rmyndharis/antigravity-skills search kubernetes

npxRun with npx

npx @rmyndharis/antigravity-skills list

npxRun with npx

npx @rmyndharis/antigravity-skills install <skill-name>

+ 15 more commands

📖 Extracted from docs: rmyndharis/antigravity-skills

Need more details? View full documentation on GitHub →

10Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

Overview

# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

Do not use this skill when

The task is unrelated to incident runbook templates
You need a different domain or tool outside this scope

Instructions

Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures
Responding to active incidents
Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

|----------|--------|---------------|---------|

| SEV1 | Complete outage, data loss | 15 min | Production down |

| SEV2 | Major degradation | 30 min | Critical feature broken |

2. Runbook Structure

```

Overview & Impact
Detection & Alerts
Initial Triage
Mitigation Steps
Root Cause Investigation
Resolution Procedures
Verification & Rollback
Communication Templates
Escalation Matrix

```

Runbook Templates

Template 1: Service Outage Runbook

```markdown

# [Service Name] Outage Runbook

Overview

Service: Payment Processing Service

Owner: Platform Team

Slack: #payments-incidents

PagerDuty: payments-oncall

Impact Assessment

[ ] Which customers are affected?
[ ] What percentage of traffic is impacted?
[ ] Are there financial implications?
[ ] What's the blast radius?

Detection

Alerts

payment_error_rate > 5% (PagerDuty)
payment_latency_p99 > 2s (Slack)
payment_success_rate < 95% (PagerDuty)

Dashboards

[Payment Service Dashboard](https://grafana/d/payments)
[Error Tracking](https://sentry.io/payments)
[Dependency Status](https://status.stripe.com)

Initial Triage (First 5 Minutes)

1. Assess Scope

```bash

# Check service health

kubectl get pods -n payments -l app=payment-service

# Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Check error rates

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

```

2. Quick Health Checks

[ ] Can you reach the service? curl -I https://api.company.com/payments/health
[ ] Database connectivity? Check connection pool metrics
[ ] External dependencies? Check Stripe, bank API status
[ ] Recent changes? Check deploy history

3. Initial Classification

| Symptom | Likely Cause | Go To Section |

|---------|--------------|---------------|

| All requests failing | Service down | Section 4.1 |

| High latency | Database/dependency | Section 4.2 |

| Partial failures | Code bug | Section 4.3 |

| Spike in errors | Traffic surge | Section 4.4 |

Mitigation Procedures

4.1 Service Completely Down

```bash

# Step 1: Check pod status

kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

```

4.2 High Latency

```bash

# Step 1: Check database connections

kubectl exec -n payments deploy/payment-service -- \

curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c "

SELECT pid, now() - query_start AS duration, query

FROM pg_stat_activity

WHERE state = 'active' AND duration > interval '5 seconds'

ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service \

STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

```

4.3 Partial Failures (Specific Errors)

```bash

# Step 1: Identify error pattern

kubectl logs -n payments -l app=payment-service --tail=500 | \

grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking

# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags \

-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c "

SELECT * FROM audit_log

WHERE table_name = 'payment_methods'

AND created_at > now() - interval '1 hour';"

```

4.4 Traffic Surge

```bash

# Step 1: Check current request rate

kubectl top pods -n payments

# Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting

kubectl set env deployment/payment-service \

RATE_LIMIT_ENABLED=true \

RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs

kubectl apply -f - <

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

name: block-suspicious

namespace: payments

spec:

podSelector:

matchLabels:

app: payment-service

ingress:

- from:

- ipBlock:

cidr: 0.0.0.0/0

except:

- 192.168.1.0/24 # Suspicious range

EOF

```

Verification Steps

```bash

# Verify service is healthy

curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows

./scripts/smoke-test-payments.sh

```

Rollback Procedures

```bash

# Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags \

-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

```

Escalation Matrix

| Condition | Escalate To | Contact |

|-----------|-------------|---------|

| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |

| Data breach suspected | Security Team | #security-incidents |

| Financial impact > $10k | Finance + Legal | @finance-oncall |

| Customer communication needed | Support Lead | @support-lead |

Communication Templates

Initial Notification (Internal)

```

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2

Status: Investigating

Impact: ~20% of payment requests failing

Start Time: [TIME]

Incident Commander: [NAME]

Current Actions:

Investigating root cause
Scaling up service
Monitoring dashboards

Updates in #payments-incidents

```

Status Update

```

📊 UPDATE: Payment Service Incident

Status: Mitigating

Impact: Reduced to ~5% failure rate

Duration: 25 minutes

Actions Taken:

Rolled back deployment v2.3.4 → v2.3.3
Scaled service from 5 → 10 replicas

Next Steps:

Continuing to monitor
Root cause analysis in progress

ETA to Resolution: ~15 minutes

```

Resolution Notification

```

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes

Impact: ~5,000 affected transactions

Root Cause: Memory leak in v2.3.4

Resolution:

Rolled back to v2.3.3
Transactions auto-retried successfully

Follow-up:

Postmortem scheduled for [DATE]
Bug fix in progress

```

Template 2: Database Incident Runbook

```markdown

# Database Incident Runbook

Quick Reference

| Issue | Command |

|-------|---------|

| Check connections | SELECT count(*) FROM pg_stat_activity; |

| Kill query | SELECT pg_terminate_backend(pid); |

| Check replication lag | SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())); |

| Check locks | SELECT * FROM pg_locks WHERE NOT granted; |

Connection Pool Exhaustion

```sql

-- Check current connections

SELECT datname, usename, state, count(*)

FROM pg_stat_activity

GROUP BY datname, usename, state

ORDER BY count(*) DESC;

-- Identify long-running connections

SELECT pid, usename, datname, state, query_start, query

FROM pg_stat_activity

WHERE state != 'idle'

ORDER BY query_start;

-- Terminate idle connections

SELECT pg_terminate_backend(pid)

FROM pg_stat_activity

WHERE state = 'idle'

AND query_start < now() - interval '10 minutes';

```

Replication Lag

```sql

-- Check lag on replica

SELECT

CASE

WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0

ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())

END AS lag_seconds;

-- If lag > 60s, consider:

-- 1. Check network between primary/replica

-- 2. Check replica disk I/O

-- 3. Consider failover if unrecoverable

```

Disk Space Critical

```bash

# Check disk usage

df -h /var/lib/postgresql/data

# Find large tables

psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10;"

# VACUUM to reclaim space

psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

```

Best Practices

Do's

Keep runbooks updated - Review after every incident
Test runbooks regularly - Game days, chaos engineering
Include rollback steps - Always have an escape hatch
Document assumptions - What must be true for steps to work
Link to dashboards - Quick access during stress

Don'ts

Don't assume knowledge - Write for 3 AM brain
Don't skip verification - Confirm each step worked
Don't forget communication - Keep stakeholders informed
Don't work alone - Escalate early
Don't skip postmortems - Learn from every incident

Resources

[Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
[PagerDuty Incident Response](https://response.pagerduty.com/)
[Atlassian Incident Management](https://www.atlassian.com/incident-management)