🎯

runbook-creator

🎯Skill

from nik-kale/sre-skills

VibeIndex|
What it does

Generates standardized, actionable operational runbooks and incident response playbooks with executable steps and best practices.

πŸ“¦

Part of

nik-kale/sre-skills(5 items)

runbook-creator

Installation

Quick InstallInstall with npx
npx skills add nik-kale/sre-skills
Quick InstallInstall with npx
npx skills add nik-kale/sre-skills/incident-response
πŸ“– Extracted from docs: nik-kale/sre-skills
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.

Overview

# Runbook Creator

Templates and best practices for creating effective operational runbooks.

When to Use This Skill

  • Creating runbooks for new services
  • Documenting incident response procedures
  • Writing operational playbooks
  • Standardizing on-call documentation
  • Automating common procedures

Runbook Principles

  1. Actionable: Every step should be executable
  2. Testable: Verify each step works
  3. Current: Update when systems change
  4. Accessible: Available during incidents (not behind VPN-only)
  5. Linked: Referenced from alerts

Standard Runbook Template

Copy and customize this template:

```markdown

# [Service Name] - [Issue Type]

Overview

Brief description of what this runbook addresses.

Last Updated: YYYY-MM-DD

Owner: [Team/Person]

Related Alerts: [Alert names that link here]

Symptoms

What indicates this issue is occurring:

  • [ ] Symptom 1
  • [ ] Symptom 2
  • [ ] Symptom 3

Impact

  • Users Affected: [Description]
  • Severity: [SEV1/SEV2/SEV3/SEV4]
  • Business Impact: [Description]

Prerequisites

  • Access to [system/tool]
  • Permissions: [required permissions]
  • Tools: [required CLI tools]

Diagnostic Steps

Step 1: [Verify the Issue]

```bash

# Command to run

kubectl get pods -n production | grep -v Running

```

Expected Output: [What you should see]

If Different: [What to do]

Step 2: [Gather Information]

```bash

# Command to run

kubectl logs deployment/my-service -n production --tail=100

```

Look For: [What to look for in output]

Resolution Steps

Option A: [Quick Fix - e.g., Restart]

Use when: [conditions]

```bash

# Step 1: Restart the service

kubectl rollout restart deployment/my-service -n production

# Step 2: Verify pods are coming up

kubectl get pods -n production -w

```

Verification: [How to confirm fix worked]

Option B: [Rollback]

Use when: [conditions]

```bash

# Step 1: Check rollout history

kubectl rollout history deployment/my-service -n production

# Step 2: Rollback to previous version

kubectl rollout undo deployment/my-service -n production

```

Verification: [How to confirm fix worked]

Verification

How to confirm the issue is resolved:

  • [ ] Error rate returned to normal
  • [ ] Latency within SLO
  • [ ] No related alerts firing
  • [ ] User-facing functionality working

Escalation

If this runbook doesn't resolve the issue:

  1. First: Contact [Team/Person] via [Slack/Phone]
  2. Then: Page [Escalation contact]
  3. Finally: [Further escalation path]

Related Resources

  • [Dashboard Link](https://grafana/d/xxx)
  • [Architecture Diagram](link)
  • [Related Runbook](link)

Revision History

| Date | Author | Change |

|------|--------|--------|

| YYYY-MM-DD | Name | Initial version |

```

Quick Runbook Templates

Service Restart

```markdown

# [Service] - Restart Procedure

When to Use

  • Service unresponsive
  • Memory leak suspected
  • After configuration change

Steps

  1. Notify team

```

Post in #incidents: "Restarting [service] due to [reason]"

```

  1. Restart service

```bash

kubectl rollout restart deployment/[service] -n [namespace]

```

  1. Monitor rollout

```bash

kubectl rollout status deployment/[service] -n [namespace]

```

  1. Verify health

```bash

kubectl get pods -n [namespace] | grep [service]

# All pods should be Running, 1/1 Ready

```

  1. Check metrics

- Error rate: [dashboard link]

- Latency: [dashboard link]

Rollback

If restart makes things worse:

```bash

kubectl rollout undo deployment/[service] -n [namespace]

```

```

Database Failover

```markdown

# [Database] - Failover Procedure

When to Use

  • Primary database unresponsive
  • Planned maintenance
  • Primary showing errors

Prerequisites

  • Database admin access
  • Verify replica is in sync

Pre-Failover Checks

  1. Check replication status

```sql

SELECT * FROM pg_stat_replication;

```

Verify: state = 'streaming', lag is minimal

  1. Check replica health

```bash

pg_isready -h replica-host -p 5432

```

Failover Steps

  1. Stop writes to primary (if possible)

```sql

ALTER SYSTEM SET default_transaction_read_only = on;

SELECT pg_reload_conf();

```

  1. Promote replica

```bash

pg_ctl promote -D /var/lib/postgresql/data

```

  1. Update connection strings

- Update DNS/load balancer to point to new primary

- Or update application config

  1. Verify applications reconnected

```sql

SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

```

Post-Failover

  • [ ] Monitor error rates
  • [ ] Set up new replica from old primary
  • [ ] Update documentation

```

Cache Clear

```markdown

# [Service] - Cache Clear Procedure

When to Use

  • Stale data being served
  • Cache corruption suspected
  • After data migration

Impact Assessment

  • Cache clear will cause temporary latency spike
  • Database load will increase temporarily

Steps

  1. Notify team

```

Post in #incidents: "Clearing [cache] cache due to [reason]"

```

  1. Clear cache

Redis - All keys:

```bash

redis-cli -h [host] FLUSHALL

```

Redis - Specific pattern:

```bash

redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL

```

Application cache:

```bash

curl -X POST http://[service]/admin/cache/clear

```

  1. Monitor

- Watch cache hit rate recover

- Monitor database load

- Check latency

Verification

  • Cache hit rate returning to normal
  • No errors from cache operations
  • Latency stabilizing

```

Runbook Checklist

Before publishing a runbook, verify:

```

Runbook Quality Checklist:

  • [ ] Title clearly describes the issue/procedure
  • [ ] Symptoms section helps identify when to use
  • [ ] All commands are copy-pasteable
  • [ ] Expected output documented for each command
  • [ ] Verification steps confirm success
  • [ ] Escalation path is clear
  • [ ] Links to dashboards work
  • [ ] Tested by someone other than author
  • [ ] Linked from relevant alerts

```

Automation Integration

Runbook with Automation Hooks

```markdown

# [Service] - Automated Recovery

Automatic Actions

The following actions run automatically:

  1. Pod restart on OOMKilled (Kubernetes)
  2. Scale-up on high CPU (HPA)

Manual Steps (if auto-recovery fails)

Check why auto-recovery failed

```bash

kubectl describe hpa [service] -n [namespace]

kubectl get events -n [namespace] --sort-by='.lastTimestamp'

```

Manual intervention

[Steps here]

```

Script-Backed Runbook

```markdown

# [Service] - Diagnostic Script

Quick Diagnosis

Run the diagnostic script:

```bash

./scripts/diagnose-service.sh [service-name]

```

This script checks:

  • Pod status
  • Recent logs
  • Resource usage
  • Dependency health

Interpreting Results

| Result | Meaning | Action |

|--------|---------|--------|

| HEALTHY | All checks pass | No action needed |

| DEGRADED | Some issues | Follow specific recommendations |

| CRITICAL | Major issues | Escalate immediately |

```

Common Runbook Categories

Every service should have runbooks for:

```

Essential Runbooks:

  • [ ] Service restart
  • [ ] Rollback deployment
  • [ ] Scale up/down
  • [ ] Clear cache
  • [ ] Database failover (if applicable)
  • [ ] Dependency failure response
  • [ ] High error rate investigation
  • [ ] High latency investigation

```

Additional Resources

  • [Example Runbooks](references/example-runbooks.md)
  • [Runbook Automation](references/automation.md)