incident-response
π―Skillfrom nik-kale/sre-skills
Systematically investigates and resolves production incidents through structured triage, data gathering, impact assessment, and root cause analysis.
Part of
nik-kale/sre-skills(5 items)
Installation
npx skills add nik-kale/sre-skillsnpx skills add nik-kale/sre-skills/incident-responseSkill Details
Guide systematic investigation of production incidents including triage, data gathering, impact assessment, and root cause analysis. Use when investigating outages, service degradation, production errors, alerts firing, or when the user mentions incident, outage, downtime, or production issues.
More from this repository4
Generates standardized, actionable operational runbooks and incident response playbooks with executable steps and best practices.
Systematically evaluates service readiness for production by assessing reliability, security, observability, and operational requirements through a comprehensive checklist.
Helps SREs diagnose and resolve Kubernetes cluster issues through comprehensive troubleshooting techniques and diagnostic commands.
Automates infrastructure observability setup by configuring monitoring, logging, and tracing tools for comprehensive system visibility.