1. Initial Triage
Use Task tool (subagent_type="devops-troubleshooter") for AI-powered analysis:
- Error pattern recognition
- Stack trace analysis with probable causes
- Component dependency analysis
- Severity assessment
- Generate 3-5 ranked hypotheses
- Recommend debugging strategy
2. Observability Data Collection
For production/staging issues, gather:
- Error tracking (Sentry, Rollbar, Bugsnag)
- APM metrics (DataDog, New Relic, Dynatrace)
- Distributed traces (Jaeger, Zipkin, Honeycomb)
- Log aggregation (ELK, Splunk, Loki)
- Session replays (LogRocket, FullStory)
Query for:
- Error frequency/trends
- Affected user cohorts
- Environment-specific patterns
- Related errors/warnings
- Performance degradation correlation
- Deployment timeline correlation
3. Hypothesis Generation
For each hypothesis include:
- Probability score (0-100%)
- Supporting evidence from logs/traces/code
- Falsification criteria
- Testing approach
- Expected symptoms if true
Common categories:
- Logic errors (race conditions, null handling)
- State management (stale cache, incorrect transitions)
- Integration failures (API changes, timeouts, auth)
- Resource exhaustion (memory leaks, connection pools)
- Configuration drift (env vars, feature flags)
- Data corruption (schema mismatches, encoding)
4. Strategy Selection
Select based on issue characteristics:
Interactive Debugging: Reproducible locally β VS Code/Chrome DevTools, step-through
Observability-Driven: Production issues β Sentry/DataDog/Honeycomb, trace analysis
Time-Travel: Complex state issues β rr/Redux DevTools, record & replay
Chaos Engineering: Intermittent under load β Chaos Monkey/Gremlin, inject failures
Statistical: Small % of cases β Delta debugging, compare success vs failure
5. Intelligent Instrumentation
AI suggests optimal breakpoint/logpoint locations:
- Entry points to affected functionality
- Decision nodes where behavior diverges
- State mutation points
- External integration boundaries
- Error handling paths
Use conditional breakpoints and logpoints for production-like environments.
6. Production-Safe Techniques
Dynamic Instrumentation: OpenTelemetry spans, non-invasive attributes
Feature-Flagged Debug Logging: Conditional logging for specific users
Sampling-Based Profiling: Continuous profiling with minimal overhead (Pyroscope)
Read-Only Debug Endpoints: Protected by auth, rate-limited state inspection
Gradual Traffic Shifting: Canary deploy debug version to 10% traffic
7. Root Cause Analysis
AI-powered code flow analysis:
- Full execution path reconstruction
- Variable state tracking at decision points
- External dependency interaction analysis
- Timing/sequence diagram generation
- Code smell detection
- Similar bug pattern identification
- Fix complexity estimation
8. Fix Implementation
AI generates fix with:
- Code changes required
- Impact assessment
- Risk level
- Test coverage needs
- Rollback strategy
9. Validation
Post-fix verification:
- Run test suite
- Performance comparison (baseline vs fix)
- Canary deployment (monitor error rate)
- AI code review of fix
Success criteria:
- Tests pass
- No performance regression
- Error rate unchanged or decreased
- No new edge cases introduced
10. Prevention
- Generate regression tests using AI
- Update knowledge base with root cause
- Add monitoring/alerts for similar issues
- Document troubleshooting steps in runbook