testing-validator
π―Skillfrom adaptationio/skrillz
testing-validator skill from adaptationio/skrillz
Part of
adaptationio/skrillz(191 items)
Installation
python3 scripts/validate-examples.py /path/to/skillPython Examples (3 total):python3 scripts/test-runner.py /path/to/skill --mode comprehensivepython3 scripts/generate-test-report.py test-results.json --output report.mdSkill Details
Comprehensive testing validation for Claude Code skills through functional testing, example validation, integration testing, regression testing, and edge case testing. Task-based testing operations with automated example execution, manual scenario testing, and test reporting. Use when testing skill functionality, validating examples execute correctly, ensuring integration works, preventing regressions, or conducting comprehensive functional quality assurance.
Overview
# Testing Validator
Overview
testing-validator provides comprehensive functional testing for Claude Code skills, validating that skills actually work correctly in practice through systematic testing operations.
Purpose: Functional validation - ensure skills work correctly, not just look good
The 5 Testing Operations:
- Functional Testing - Core skill functionality works as intended
- Example Validation - All code/command examples execute successfully
- Integration Testing - Skills work correctly with dependencies and compositions
- Regression Testing - Updates don't break existing functionality
- Edge Case Testing - Handles unusual scenarios and boundary conditions
Complement to review-multi:
- review-multi: Quality assessment (structure, content, patterns, usability) - "Is it good?"
- testing-validator: Functional validation (does it work, examples execute, integrations function) - "Does it work?"
- Together: Complete validation (quality + functionality)
Key Benefits:
- Automated example execution (catch broken examples)
- Integration validation (ensure skills compose correctly)
- Regression prevention (detect breaks from updates)
- Edge case coverage (handle unusual scenarios)
- Systematic testing (consistent, repeatable)
When to Use
Use testing-validator when:
- Pre-Deployment Testing - Validate functionality before release
- Example Validation - Ensure all examples execute correctly
- Integration Validation - Test workflow skills and dependencies
- Post-Update Testing - Regression testing after changes
- Comprehensive QA - Combined with review-multi for complete validation
- CI/CD Integration - Automated testing in pipelines
- Edge Case Validation - Test boundary conditions and unusual scenarios
- Functional Certification - Certify skills work correctly in practice
Prerequisites
- Skill to test
- Ability to execute examples (appropriate environment)
- Time allocation:
- Quick Check: 15-30 minutes
- Single Operation: 20-90 minutes
- Comprehensive Testing: 2-4 hours
Operations
Operation 1: Functional Testing
Purpose: Validate core skill functionality works as intended
When to Use This Operation:
- Testing if skill achieves stated purpose
- Validating core functionality
- Checking if instructions lead to successful outcomes
- Pre-deployment functional validation
Automation Level: 30% automated (script checks), 70% manual (scenario execution)
Process:
- Select Test Scenarios
- Choose 2-3 scenarios from "When to Use" section
- Prioritize: primary use case + common case + edge case
- Ensure scenarios cover main functionality
- Execute Scenarios
- Actually follow skill instructions
- Complete the intended task
- Document results (success/partial/failure)
- Note any issues encountered
- Validate Outputs
- Does skill produce expected outputs?
- Are outputs useful and correct?
- Do outputs match documentation?
- Check Error Handling
- What happens with errors?
- Are error messages helpful?
- Can users recover from errors?
- Assess Functionality
- Does skill achieve stated purpose?
- Is functionality complete?
- Are there functional gaps?
Validation Checklist:
- [ ] Primary use case tested (from "When to Use")
- [ ] Common use case tested
- [ ] Edge case tested (if applicable)
- [ ] All scenarios completed successfully
- [ ] Outputs correct and useful
- [ ] Error handling works (if errors encountered)
- [ ] Functionality complete (no gaps)
- [ ] Skill achieves stated purpose
Test Results:
- PASS: All scenarios succeed, functionality complete
- PARTIAL: Some scenarios succeed, minor issues
- FAIL: Scenarios fail, functionality broken
Outputs:
- Test result (PASS/PARTIAL/FAIL)
- Scenario execution results
- Functional issues identified (if any)
- Recommendations for fixes
Time Estimate: 30-90 minutes
Example:
```
Functional Testing: skill-researcher
====================================
Test Scenarios:
- Primary: Research GitHub API integration patterns
- Common: Research for skill development planning
- Edge: Research with no results found
Scenario 1: GitHub API Integration Research
- Executed: Operation 2 (GitHub Repository Research)
- Result: β SUCCESS
- Time: 25 minutes
- Output: Found 5 repositories, extracted patterns
- Functionality: Achieved purpose (research complete)
Scenario 2: Skill Development Research
- Executed: All 5 operations (Web, GitHub, Docs, Synthesis)
- Result: β SUCCESS
- Time: 60 minutes
- Output: Research synthesis with 4 sources, 3 patterns
- Functionality: Fully achieved purpose
Scenario 3: No Results Edge Case
- Executed: Web search for obscure topic
- Result: β HANDLED
- Time: 10 minutes
- Output: "No results found" with guidance to adjust search
- Error Handling: Good (helpful message, suggests alternatives)
Overall Functional Test: β PASS
- All scenarios succeeded
- Functionality complete
- Error handling works
- Achieves stated purpose
```
---
Operation 2: Example Validation
Purpose: Verify all code/command examples in skill documentation execute correctly
When to Use This Operation:
- Validating documentation accuracy
- Ensuring examples are current and working
- Preventing broken example deployment
- Post-update example regression testing
Automation Level: 80% automated (example extraction and execution)
Process:
- Extract All Examples
- Scan SKILL.md for code blocks (```)
- Extract examples with language tags
- Identify executable vs informational examples
- Count total examples
- Categorize Examples
- Shell/bash commands
- Python code snippets
- YAML/config samples
- Informational (not executable)
- Execute Examples Automatically
```bash
python3 scripts/validate-examples.py /path/to/skill
```
- Executes all bash/python examples
- Captures output and errors
- Compares to expected output (if documented)
- Reports success/failure per example
- Manual Validation (for non-automatable):
- Configuration examples (check syntax)
- Conceptual examples (check accuracy)
- Workflow examples (check logic)
- Generate Example Report
- Total examples: X
- Executable: Y (Z%)
- Passed: A
- Failed: B
- Success rate: A/(A+B) Γ 100%
Validation Checklist:
- [ ] All examples extracted and counted
- [ ] Executable examples identified
- [ ] Automated validation run (bash/python examples)
- [ ] Non-executable examples checked manually
- [ ] All examples execute successfully OR expected failures documented
- [ ] Broken examples identified with fixes
- [ ] Success rate β₯90% (for production)
Test Results:
- PASS: β₯90% of executable examples work correctly
- PARTIAL: 70-89% examples work, some broken
- FAIL: <70% examples work, many broken
Outputs:
- Example inventory (total, executable, non-executable)
- Execution results per example
- Success rate
- Broken examples list with error messages
- Recommendations for fixes
Time Estimate: 20-45 minutes (mostly automated)
Example:
```
Example Validation: review-multi
=================================
Extraction Results:
- Total examples: 18
- Executable (bash): 12
- Executable (python): 3
- Informational (YAML): 3
Automated Execution:
Bash Examples (12 total):
β
PASS: python3 scripts/validate-structure.py
β
PASS: python3 scripts/check-patterns.py
β
PASS: python3 scripts/generate-review-report.py
β
PASS: python3 scripts/review-runner.py
β οΈ WARNING: Example uses placeholder
- Success Rate: 12/12 (100%)
Python Examples (3 total):
β PASS: All 3 syntax-valid, execute correctly
- Success Rate: 3/3 (100%)
Manual Validation (3 YAML examples):
β PASS: All YAML examples valid syntax
β PASS: Frontmatter examples follow standards
Overall Example Validation: β PASS
- Success Rate: 100% (18/18 examples work)
- Minor Note: Some examples use placeholders (acceptable with clear notes)
Recommendation: Examples excellent, all functional
```
---
Operation 3: Integration Testing
Purpose: Test skills work correctly with other skills, especially in workflows and compositions
When to Use This Operation:
- Testing workflow skills (that compose others)
- Validating dependencies work correctly
- Checking skill integration points
- Testing data flow between skills
Automation Level: 20% automated (dependency checking), 80% manual (actual integration testing)
Process:
- Identify Integration Points
- Does skill depend on other skills?
- Does skill compose with others (workflow)?
- Are there data flows between skills?
- Integration examples provided?
- Test Skill Dependencies
- Load required skills (can they be loaded?)
- Execute dependent functionality
- Verify dependency works as expected
- Check version compatibility (if applicable)
- Test Workflow Compositions
- For workflow skills: execute multi-skill workflow
- Verify data flows correctly between steps
- Check each component skill integration
- Validate output-to-input transitions
- Test Integration Examples
- Execute documented integration examples
- Verify skills compose as documented
- Check integration instructions accurate
- Assess Integration Quality
- Integrations smooth or problematic?
- Data flows correctly?
- Clear integration guidance?
- Error handling across skill boundaries?
Validation Checklist:
- [ ] Integration points identified
- [ ] Dependencies tested (if applicable)
- [ ] Workflow composition tested (if workflow skill)
- [ ] Data flow validated (inputs/outputs correct)
- [ ] Integration examples execute successfully
- [ ] Cross-skill error handling works
- [ ] Integration guidance accurate
- [ ] No integration issues found
Test Results:
- PASS: All integrations work smoothly
- PARTIAL: Integrations work with minor issues
- FAIL: Integration broken or major issues
- N/A: Standalone skill with no integrations
Outputs:
- Integration test results
- Workflow execution results (if applicable)
- Data flow validation
- Integration issues (if any)
- Recommendations
Time Estimate: 30-90 minutes (varies by integration complexity, N/A for standalone skills)
Example:
```
Integration Testing: development-workflow
==========================================
Integration Type: Workflow Composition (5 component skills)
Dependencies Identified:
- skill-researcher (Step 1)
- planning-architect (Step 2)
- task-development (Step 3, optional)
- prompt-builder (Step 4)
- todo-management (Step 5)
Integration Test Execution:
Step 1 β Step 2 Integration:
- Input to Step 2: research-synthesis.md from Step 1
- Test: Create research synthesis, feed to planning-architect
- Result: β PASS (planning-architect correctly uses research findings)
- Data Flow: Smooth (outputs match expected inputs)
Step 2 β Step 3 Integration:
- Input to Step 3: skill-architecture-plan.md from Step 2
- Test: Create architecture plan, feed to task-development
- Result: β PASS (task-development breaks down plan correctly)
- Data Flow: Smooth
Step 3 β Step 4 Integration:
- Input to Step 4: task-breakdown.md from Step 3
- Test: Create task breakdown, feed to prompt-builder
- Result: β PASS (prompt-builder creates prompts for tasks)
- Data Flow: Smooth
Step 4 β Step 5 Integration:
- Input to Step 5: prompts-collection.md from Step 4
- Test: Create prompts, feed to todo-management
- Result: β PASS (todo-management creates todos from tasks)
- Data Flow: Smooth
Workflow Execution Test:
- Executed: Complete workflow (all 5 steps)
- Result: β SUCCESS (produced complete skill planning artifacts)
- Time: 4.5 hours (as documented)
- Quality: High (artifacts complete and usable)
Overall Integration Test: β PASS
- All 5 integrations work smoothly
- Data flows correctly between steps
- Workflow achieves stated purpose
- No integration issues found
```
---
Operation 4: Regression Testing
Purpose: Ensure updates don't break existing functionality
When to Use This Operation:
- After skill updates or improvements
- Before deploying changes
- Validating skill-updater changes
- Post-auto-updater verification
Automation Level: 60% automated (comparison, example re-execution), 40% manual
Process:
- Establish Baseline
- Before changes: run tests, document results
- Save baseline test results
- Note which examples/scenarios work
- Apply Changes
- Make updates to skill
- Document what changed
- Re-Run Tests
- Re-execute same tests as baseline
- Run example validation again
- Test same scenarios
- Compare Results
- Before vs After comparison
- Which tests changed status?
- New failures? (regressions)
- New successes? (improvements)
- Unchanged? (stable)
- Identify Regressions
- Tests that passed before but fail now
- Functionality that worked but now broken
- Examples that executed but now error
Validation Checklist:
- [ ] Baseline tests documented (before changes)
- [ ] Changes applied and documented
- [ ] All baseline tests re-executed
- [ ] Results compared (before vs after)
- [ ] No new failures (no regressions)
- [ ] If failures: identified and documented
- [ ] Regression fixes applied (if needed)
- [ ] Final validation: all tests pass
Test Results:
- PASS: No regressions (all baseline tests still pass)
- REGRESSION: Some tests failed that previously passed
- IMPROVED: Some tests pass that previously failed (plus no regressions)
Outputs:
- Regression test report
- Before/after comparison
- Identified regressions (if any)
- Regression fixes (if applicable)
- Final test status
Time Estimate: 30-60 minutes
Example:
```
Regression Testing: planning-architect (after Quick Ref addition)
==================================================================
Baseline (Before Quick Reference):
- Structure validation: 5/5 (PASS)
- Example count: 8 examples
- All examples: Execute successfully
- Scenarios tested: 2 scenarios (both PASS)
Changes Applied:
- Added Quick Reference section (96 lines)
- Added tables, checklists, decision tree
Re-Run Tests (After Quick Reference):
- Structure validation: 5/5 (PASS) β No regression
- Example count: 8 examples β No change
- All examples: Execute successfully β No regression
- Scenarios tested: 2 scenarios (both PASS) β No regression
- NEW: Quick Reference detected β Improvement
Comparison:
β All baseline tests still pass (no regressions)
β New functionality added (Quick Reference)
β Quality maintained (5/5 score)
Overall Regression Test: β PASS (No Regressions)
Additional: β IMPROVEMENT (Quick Reference added)
Recommendation: Changes safe to deploy
```
---
Operation 5: Edge Case Testing
Purpose: Test skill handles unusual scenarios, boundary conditions, and edge cases correctly
When to Use This Operation:
- Testing robustness
- Validating error handling
- Checking boundary conditions
- Ensuring graceful degradation
Automation Level: 30% automated (known edge case checks), 70% manual (scenario thinking)
Process:
- Identify Edge Cases
- Empty inputs (what if no data?)
- Maximum inputs (what if too much data?)
- Invalid inputs (what if wrong format?)
- Missing dependencies (what if skill not found?)
- Boundary conditions (limits, thresholds)
- Design Edge Case Tests
- Create test scenarios for each edge case
- Define expected behavior
- Document pass criteria
- Execute Edge Case Tests
- Test with empty/minimal inputs
- Test with maximum/excessive inputs
- Test with invalid/malformed inputs
- Test with missing dependencies
- Test boundary conditions
- Evaluate Handling
- Does skill handle edge case gracefully?
- Error messages clear and helpful?
- No crashes or undefined behavior?
- Appropriate fallbacks or defaults?
- Document Edge Case Behavior
- Which edge cases handled well?
- Which edge cases cause issues?
- Expected vs actual behavior
- Recommendations for improvement
Validation Checklist:
- [ ] Edge cases identified (minimum 3-5)
- [ ] Each edge case tested
- [ ] Error handling assessed
- [ ] No crashes or undefined behavior
- [ ] Error messages helpful (if applicable)
- [ ] Graceful degradation (if applicable)
- [ ] Edge case handling documented
- [ ] Critical edge cases handled correctly
Test Results:
- PASS: All critical edge cases handled correctly
- PARTIAL: Most edge cases handled, some issues
- FAIL: Critical edge cases cause errors or crashes
Outputs:
- Edge case test results
- Handling quality assessment
- Issues identified
- Recommendations for robustness
Time Estimate: 30-90 minutes
Example:
```
Edge Case Testing: todo-management
===================================
Edge Cases Identified:
- Empty task list (initialize with 0 tasks)
- Single task (minimal usage)
- 100+ tasks (maximum usage)
- Starting non-existent task
- Completing already completed task
Edge Case Tests:
Test 1: Empty Task List
- Scenario: Initialize with empty list
- Execution: todo-management Operation 1 with 0 tasks
- Result: β PASS (handles gracefully, shows empty state)
- Error: None
Test 2: Single Task
- Scenario: List with 1 task only
- Execution: Complete workflow on 1 task
- Result: β PASS (works correctly, minimal case handled)
Test 3: 100 Tasks
- Scenario: Large task list
- Execution: Report progress on 100-task list
- Result: β PASS (handles large lists, performance acceptable)
- Note: Report generation ~5 seconds (good)
Test 4: Non-Existent Task
- Scenario: Start task #999 (doesn't exist)
- Execution: Operation 2 (Start Task 999)
- Result: β PASS (clear error: "Task 999 not found")
- Error Handling: Excellent (specific error message)
Test 5: Double Complete
- Scenario: Complete task #5 twice
- Execution: Operation 3 twice on same task
- Result: β PASS (second attempt shows "Already completed")
- Error Handling: Good (informative message)
Overall Edge Case Test: β PASS
- All critical edge cases handled correctly
- Error messages clear and helpful
- No crashes or undefined behavior
- Graceful handling of unusual scenarios
Recommendation: Edge case handling excellent
```
---
Testing Modes
Comprehensive Testing Mode
Purpose: Complete functional validation across all 5 operations
When to Use:
- Pre-deployment (ensure everything works)
- Major updates (comprehensive regression testing)
- Quality certification (complete functional validation)
Process:
- Run all 5 testing operations
- Aggregate results
- Generate comprehensive test report
- Make deployment decision
Time Estimate: 2-4 hours
Output: Complete test report with PASS/FAIL for deployment
---
Quick Check Mode
Purpose: Fast functional validation (examples only)
When to Use:
- During development (continuous testing)
- Quick validation (examples work?)
- Pre-commit checks
Process:
- Run Operation 2 only (Example Validation)
- Automated execution of all examples
- Quick pass/fail report
Time Estimate: 15-30 minutes (automated)
Output: Example validation results
---
Custom Testing Mode
Purpose: Select specific operations based on needs
When to Use:
- Targeted testing (only certain aspects)
- Time constraints (can't do comprehensive)
- Specific concerns (e.g., only integration testing)
Process:
- Select operations to run (1-5)
- Execute selected tests
- Generate targeted report
---
Best Practices
1. Test Early and Often
Practice: Run Quick Check during development, Comprehensive before deployment
Rationale: Early testing catches issues before they compound
Application: Quick Check daily, Comprehensive pre-deploy
2. Automate Example Validation
Practice: Use automated example validation (validate-examples.py)
Rationale: 80% automated, fast, catches broken examples instantly
Application: Run after any example changes
3. Test Real Scenarios
Practice: Use actual use cases for functional testing
Rationale: Real scenarios reveal issues documentation review misses
Application: Test scenarios from "When to Use" section
4. Regression Test After Updates
Practice: Always run regression tests after skill changes
Rationale: Prevents breaking existing functionality with improvements
Application: Before/after comparison for all updates
5. Document Test Results
Practice: Save test reports for comparison over time
Rationale: Track testing trends, identify patterns
Application: Generate test report for each comprehensive test
6. Fix Broken Examples Immediately
Practice: Don't deploy with broken examples
Rationale: Broken examples destroy user confidence
Application: Example validation must PASS before deploy
---
Common Mistakes
Mistake 1: Skipping Example Validation
Symptom: Users report broken examples after deployment
Cause: Not testing examples before release
Fix: Run Operation 2 (Example Validation) before every deployment
Prevention: Make example validation mandatory in deployment checklist
Mistake 2: Only Testing Happy Path
Symptom: Skills break with unusual inputs or edge cases
Cause: Not testing edge cases
Fix: Run Operation 5 (Edge Case Testing)
Prevention: Include edge case testing in comprehensive mode
Mistake 3: No Regression Testing
Symptom: Updates break previously working functionality
Cause: Not testing before/after updates
Fix: Run Operation 4 (Regression Testing) after changes
Prevention: Make regression testing mandatory for all updates
Mistake 4: Not Testing Integrations
Symptom: Workflow skills break when actually composing other skills
Cause: Testing skills individually, not integrated
Fix: Run Operation 3 (Integration Testing) for workflow skills
Prevention: Always test integrations for workflow/composition skills
Mistake 5: Manual Testing Only
Symptom: Testing takes too long, often skipped
Cause: Not using automation
Fix: Use validate-examples.py for automated example checking
Prevention: Automate where possible (examples, scripts, structure)
---
Quick Reference
The 5 Testing Operations
| Operation | Focus | Automation | Time | Pass Criteria |
|-----------|-------|------------|------|---------------|
| Functional | Core functionality works | 30% | 30-90m | Scenarios succeed |
| Example Validation | Examples execute correctly | 80% | 20-45m | β₯90% examples work |
| Integration | Skills work together | 20% | 30-90m | Integrations smooth |
| Regression | Updates don't break functionality | 60% | 30-60m | No new failures |
| Edge Case | Handles unusual scenarios | 30% | 30-90m | Critical edge cases handled |
Testing Modes
| Mode | Time | Operations | Use Case |
|------|------|------------|----------|
| Comprehensive | 2-4h | All 5 operations | Pre-deployment, certification |
| Quick Check | 15-30m | Example validation only | During development |
| Custom | Variable | Selected operations | Targeted testing |
Test Results
| Result | Meaning | Action |
|--------|---------|--------|
| PASS | All tests successful | Deploy with confidence |
| PARTIAL | Some issues, not critical | Fix issues, re-test, then deploy |
| FAIL | Critical issues | Fix before deployment |
Integration with review-multi
Use Both for Complete Validation:
```
review-multi (quality) + testing-validator (functionality) = Complete Validation
review-multi: Is it good? (structure, content, patterns, usability)
testing-validator: Does it work? (functional, examples, integration)
Together: Ready to deploy? (quality + functionality validated)
```
Automation Scripts
```bash
# Validate all examples automatically
python3 scripts/validate-examples.py /path/to/skill
# Run comprehensive test suite
python3 scripts/test-runner.py /path/to/skill --mode comprehensive
# Generate test report
python3 scripts/generate-test-report.py test-results.json --output report.md
```
For More Information
- Functional testing: references/functional-testing-guide.md
- Example validation: references/example-validation-guide.md
- Integration testing: references/integration-testing-guide.md
- Regression testing: references/regression-testing-guide.md
- Edge case testing: references/edge-case-testing-guide.md
- Test reports: references/test-report-template.md
---
testing-validator ensures skills work correctly through comprehensive functional testing, example validation, integration testing, regression testing, and edge case validation.
More from this repository10
xai-stock-sentiment skill from adaptationio/skrillz
Generates Ralph-compatible prompts for single implementation tasks with clear completion criteria and automatic verification.
Orchestrates continuous autonomous coding sessions, managing feature implementation, testing, and progress tracking with intelligent checkpointing and recovery mechanisms.
auto-claude-troubleshooting skill from adaptationio/skrillz
auto-claude-setup skill from adaptationio/skrillz
Analyzes Claude Code observability data to generate insights on performance, costs, errors, tool usage, sessions, conversations, and subagents through advanced metrics and log querying.
xai-financial-integration skill from adaptationio/skrillz
xai-agent-tools skill from adaptationio/skrillz
xai-crypto-sentiment skill from adaptationio/skrillz
auto-claude-memory skill from adaptationio/skrillz