🎯

planning-disaster-recovery

🎯Skill

from ancoleman/ai-design-components

VibeIndex|
What it does

planning-disaster-recovery skill from ancoleman/ai-design-components

πŸ“¦

Part of

ancoleman/ai-design-components(71 items)

planning-disaster-recovery

Installation

git cloneClone repository
git clone https://github.com/ancoleman/ai-design-components.git
Shell ScriptRun shell script
./install.sh
Shell ScriptRun shell script
./install.sh # Interactive mode
Shell ScriptRun shell script
./install.sh plugins list # See all available plugins
Shell ScriptRun shell script
./install.sh plugins install # Install all plugins

+ 1 more commands

πŸ“– Extracted from docs: ancoleman/ai-design-components
2Installs
158
-
Last UpdatedDec 11, 2025

Skill Details

SKILL.md

Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.

Overview

# Disaster Recovery

Purpose

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

When to Use This Skill

Invoke this skill when:

  • Defining recovery time objectives (RTO) and recovery point objectives (RPO)
  • Implementing database backups with point-in-time recovery (PITR)
  • Setting up Kubernetes cluster backup and restore workflows
  • Configuring cross-region replication for high availability
  • Testing disaster recovery procedures through chaos experiments
  • Meeting compliance requirements (GDPR, SOC 2, HIPAA)
  • Automating backup monitoring and alerting
  • Designing multi-cloud disaster recovery architectures

Core Concepts

RTO and RPO Fundamentals

Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

Criticality Tiers:

  • Tier 0 (Mission-Critical): RTO < 1 hour, RPO < 5 minutes
  • Tier 1 (Production): RTO 1-4 hours, RPO 15-60 minutes
  • Tier 2 (Important): RTO 4-24 hours, RPO 1-6 hours
  • Tier 3 (Standard): RTO > 24 hours, RPO > 6 hours

3-2-1 Backup Rule

Maintain 3 copies of data on 2 different media types with 1 copy offsite.

Example implementation:

  • Primary: Production database
  • Secondary: Local backup storage
  • Tertiary: Cloud backup (S3/GCS/Azure)

Backup Types

Full Backup: Complete copy of all data. Slowest to create, fastest to restore.

Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

Differential Backup: Changes since last full backup. Balance between storage and restore speed.

Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

Quick Decision Framework

Step 1: Map RTO/RPO to Strategy

```

RTO < 1 hour, RPO < 5 min

β†’ Active-Active replication, continuous archiving, automated failover

β†’ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR

β†’ Cost: Highest

RTO 1-4 hours, RPO 15-60 min

β†’ Warm standby, incremental backups, automated failover

β†’ Tools: pgBackRest, WAL-G, RDS Multi-AZ

β†’ Cost: High

RTO 4-24 hours, RPO 1-6 hours

β†’ Daily full + incremental, cross-region backup

β†’ Tools: pgBackRest, Velero, Restic

β†’ Cost: Medium

RTO > 24 hours, RPO > 6 hours

β†’ Weekly full + daily incremental, single region

β†’ Tools: pg_dump, mysqldump, S3 versioning

β†’ Cost: Low

```

Step 2: Select Backup Tools by Use Case

| Use Case | Primary Tool | Alternative | Key Feature |

|----------|-------------|-------------|-------------|

| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |

| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |

| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |

| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |

| File/object backup | Restic | Duplicity | Encryption, deduplication |

| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |

Database Backup Patterns

PostgreSQL with pgBackRest

Use Case: Production PostgreSQL with < 5 minute RPO

Quick Start: See examples/postgresql/pgbackrest-config/

Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with pgbackrest --stanza=main --delta restore.

Detailed Guide: references/database-backups.md#postgresql

MySQL with Percona XtraBackup

Use Case: MySQL production requiring hot backups

Quick Start: See examples/mysql/xtrabackup/

Perform full (xtrabackup --backup --parallel=4) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

Detailed Guide: references/database-backups.md#mysql

MongoDB Backup

Quick Start: Use mongodump --gzip --numParallelCollections=4 for logical backups or MongoDB Atlas for continuous backup with PITR.

Detailed Guide: references/database-backups.md#mongodb

Kubernetes Disaster Recovery

Velero for Cluster Backups

Quick Start: velero install --provider aws --bucket my-backups

Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with velero restore create --from-backup . Support selective restore (namespace mappings, storage class remapping).

Examples: examples/kubernetes/velero/

Detailed Guide: references/kubernetes-dr.md

etcd Backup

Quick Start: ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db

Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

Examples: examples/kubernetes/etcd/

Cloud-Specific DR Patterns

AWS

Key Services:

  • RDS: Automated backups (30-day retention), PITR, Multi-AZ
  • Aurora Global DB: Cross-region active-passive with automatic failover
  • S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

Examples: examples/cloud/aws/

Detailed Guide: references/cloud-dr-patterns.md#aws

GCP

Key Services:

  • Cloud SQL: PITR with 7-day transaction logs, 30-day retention
  • GCS Multi-Regional: Automatic replication across 100+ mile separation
  • Regional HA: Synchronous replication within region

Detailed Guide: references/cloud-dr-patterns.md#gcp

Azure

Key Services:

  • Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
  • Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
  • Geo-Redundant Storage: Automatic replication to secondary region

Detailed Guide: references/cloud-dr-patterns.md#azure

Cross-Region Replication Patterns

| Pattern | RTO | RPO | Cost | Use Case |

|---------|-----|-----|------|----------|

| Active-Active | < 1 min | < 1 min | High | Both regions serve traffic |

| Active-Passive | 15-60 min | 5-15 min | Medium | Standby for failover |

| Pilot Light | 10-30 min | 5-15 min | Low | Minimal secondary infra |

| Warm Standby | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |

Implementation Examples:

  • PostgreSQL streaming replication (Active-Passive)
  • Aurora Global Database (Active-Active)
  • ASG scale-up automation (Pilot Light)

Detailed Guide: references/cross-region-replication.md

Testing Disaster Recovery

Chaos Engineering

Purpose: Validate DR procedures through controlled failure injection.

Test Scenarios:

  • Database failover (stop primary, measure promotion time)
  • Region failure (block network, trigger DNS failover)
  • Kubernetes recovery (delete namespace, restore from Velero)

Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy

Examples: examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.sh

Detailed Guide: references/chaos-engineering.md

Automated DR Drills

Run Monthly Tests:

```bash

./scripts/dr-drill.sh --environment staging --test-type full

./scripts/test-restore.sh --backup latest --target staging-db

```

Compliance and Retention

| Regulation | Retention | Requirements |

|------------|-----------|--------------|

| GDPR | 1-7 years | EU data residency, right to erasure |

| SOC 2 | 1 year+ | Secure deletion, access controls |

| HIPAA | 6 years | Encryption, PHI protection |

| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |

Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive

Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.

Detailed Guide: references/compliance-retention.md

Monitoring and Alerting

Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization

Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend

Validation Scripts:

```bash

./scripts/validate-backup.sh --backup latest --verify-integrity

./scripts/check-retention.sh --report-violations

./scripts/generate-dr-report.sh --format pdf

```

Automation and Runbooks

Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)

DR Runbook Steps: Detect failure β†’ Verify secondary β†’ Promote β†’ Update DNS β†’ Notify β†’ Document

Detailed Guide: references/runbook-automation.md

Integration with Other Skills

Related Skills

Prerequisites:

  • infrastructure-as-code: Provision backup infrastructure, DR regions
  • kubernetes-operations: K8s cluster setup for Velero
  • secret-management: Backup encryption keys, credentials

Parallel Skills:

  • databases-postgresql: PostgreSQL configuration and operations
  • databases-mysql: MySQL configuration and operations
  • observability: Backup monitoring, alerting
  • security-hardening: Secure backup storage, access control

Consumer Skills:

  • incident-management: Invoke DR procedures during incidents
  • compliance-frameworks: Meet regulatory requirements

Skill Chaining Example

```

infrastructure-as-code β†’ secret-management β†’ disaster-recovery β†’ observability

↓ ↓ ↓ ↓

Create S3 buckets Store encryption Configure backups Monitor jobs

Provision databases keys in Vault Set up replication Alert failures

Setup VPCs Manage credentials Test DR drills Track metrics

```

Best Practices

Do

βœ“ Test restores regularly (monthly for critical systems)

βœ“ Automate backup monitoring and alerting

βœ“ Encrypt backups at rest and in transit

βœ“ Implement 3-2-1 backup rule

βœ“ Define and measure RTO/RPO

βœ“ Run chaos experiments to validate DR

βœ“ Document recovery procedures

βœ“ Store backups in different regions

βœ“ Use immutable backups for ransomware protection

βœ“ Automate DR testing in CI/CD

Don't

βœ— Assume backups work without testing

βœ— Store all backups in single region

βœ— Skip retention policy definition

βœ— Forget to encrypt sensitive data

βœ— Rely solely on cloud provider backups

βœ— Ignore backup monitoring

βœ— Perform backups only from primary database under high load

βœ— Store encryption keys with backups

Reference Documentation

  • RTO/RPO Planning: references/rto-rpo-planning.md
  • Database Backups: references/database-backups.md
  • Kubernetes DR: references/kubernetes-dr.md
  • Cloud DR Patterns: references/cloud-dr-patterns.md
  • Cross-Region Replication: references/cross-region-replication.md
  • Chaos Engineering: references/chaos-engineering.md
  • Compliance Requirements: references/compliance-retention.md
  • Runbook Automation: references/runbook-automation.md

Examples

  • Runbooks: examples/runbooks/database-failover.md, examples/runbooks/region-failover.md
  • PostgreSQL: examples/postgresql/pgbackrest-config/, examples/postgresql/walg-config/
  • MySQL: examples/mysql/xtrabackup/, examples/mysql/walg/
  • Kubernetes: examples/kubernetes/velero/, examples/kubernetes/etcd/
  • Cloud: examples/cloud/aws/, examples/cloud/gcp/, examples/cloud/azure/
  • Chaos: examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.sh

Scripts

  • scripts/validate-backup.sh: Verify backup integrity
  • scripts/test-restore.sh: Automated restore testing
  • scripts/dr-drill.sh: Run full DR drill
  • scripts/check-retention.sh: Verify retention policies
  • scripts/generate-dr-report.sh: Compliance reporting

More from this repository10

🎯
assembling-components🎯Skill

Helps Claude systematically combine and integrate software design components, ensuring architectural coherence and reducing redundant code during full-stack development.

🎯
managing-dns🎯Skill

Manages DNS configuration, record creation, and domain management tasks across cloud providers and local network environments.

🎯
implementing-compliance🎯Skill

implementing-compliance skill from ancoleman/ai-design-components

🎯
visualizing-data🎯Skill

Helps Claude generate professional data visualizations using Python libraries like Matplotlib, Seaborn, and Plotly, with expertise in creating charts, graphs, and interactive plots tailored to diff...

πŸͺ
ancoleman-ai-design-componentsπŸͺMarketplace

Comprehensive full-stack development skills for AI-assisted development covering UI/UX, backend, DevOps, infrastructure, security, and AI/ML

🎯
generating-documentation🎯Skill

Generates comprehensive, structured documentation for code repositories, projects, and software components by analyzing source code, identifying key structures, and creating clear, professional doc...

🎯
model-serving🎯Skill

Streamlines deployment and management of machine learning models with scalable, reproducible serving infrastructure across cloud and edge environments.

🎯
implementing-drag-drop🎯Skill

implementing-drag-drop skill from ancoleman/ai-design-components

🎯
implementing-navigation🎯Skill

implementing-navigation skill from ancoleman/ai-design-components

🎯
theming-components🎯Skill

theming-components skill from ancoleman/ai-design-components