🎯

model-evaluation-benchmark

🎯Skill

from rysweet/amplihack

What it does

model-evaluation-benchmark skill from rysweet/amplihack

📦

Part of

rysweet/amplihack(81 items)

model-evaluation-benchmark

Installation

PythonRun Python server

python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

📖 Extracted from docs: rysweet/amplihack

Need more details? View full documentation on GitHub →

14Installs

Last UpdatedJan 26, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Overview

# Model Evaluation Benchmark Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

Skill Description

This skill orchestrates end-to-end model evaluation benchmarks that measure:

Efficiency: Duration, turns, cost, tool calls
Quality: Code quality scores via reviewer agents
Workflow Adherence: Subagent calls, skills used, workflow step compliance
Artifacts: GitHub issues, PRs, documentation generated

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

When to Use

✅ Use when:

Comparing AI models (Opus vs Sonnet, etc.)
Measuring workflow adherence
Generating comprehensive benchmark reports
Need reproducible benchmarking

❌ Don't use when:

Simple code reviews (use reviewer)
Performance profiling (use optimizer)
Architecture decisions (use architect)

Execution Instructions

When this skill is invoked, follow these steps:

Phase 1: Setup

Read tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
Create TodoWrite list with all phases

Phase 2: Execute Benchmarks

For each task × model:

```bash

cd tests/benchmarks/benchmark_suite_v3

python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

```

Phase 3: Analyze Results

Read all result files: ~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json
Launch parallel Task tool calls with subagent_type="reviewer" to:

- Analyze trace logs for tool/agent/skill usage

- Score code quality (1-5 scale)

Synthesize findings

Phase 4: Generate Report

Create markdown report following BENCHMARK_REPORT_V3.md structure
Create GitHub issue with report
Archive artifacts to GitHub release
Update issue with release link

Phase 5: Cleanup (MANDATORY)

Close all benchmark PRs: gh pr close {numbers}
Close all benchmark issues: gh issue close {numbers}
Remove worktrees: git worktree remove worktrees/bench-*
Verify cleanup complete

See tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md for detailed cleanup instructions.

Example Usage

```

User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.

[Executes phases 1-5 above]

Final Report: See GitHub Issue #XXXX

Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts

```

References

Reference Report: tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md
Task Definitions: tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
Cleanup Guide: tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md
Runner Script: tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

---

Last Updated: 2025-11-26

Reference Implementation: Benchmark Suite V3

GitHub Issue Example: #1698

More from this repository10

🎯

cybersecurity-analyst🎯Skill

Performs comprehensive cybersecurity analysis by evaluating events through threat modeling, risk assessment, and defensive frameworks to identify vulnerabilities and recommend mitigation strategies.

🎯

lawyer-analyst🎯Skill

lawyer-analyst skill from rysweet/amplihack

🎯

philosopher-analyst🎯Skill

philosopher-analyst skill from rysweet/amplihack

🎯

documentation-writing🎯Skill

documentation-writing skill from rysweet/amplihack

🎯

psychologist-analyst🎯Skill

psychologist-analyst skill from rysweet/amplihack