🎯

eval-recipes-runner

🎯Skill

from rysweet/amplihack

What it does

eval-recipes-runner skill from rysweet/amplihack

📦

Part of

rysweet/amplihack(81 items)

eval-recipes-runner

Installation

git cloneClone repository

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes

uv runRun with uv

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

uv runRun with uv

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

uv runRun with uv

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

uv runRun with uv

uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

+ 1 more commands

📖 Extracted from docs: rysweet/amplihack

Need more details? View full documentation on GitHub →

13Installs

Last UpdatedJan 26, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Overview

# eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

How It Works

Setup (One-Time)

```bash

# Clone eval-recipes from Microsoft

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes

cd ~/eval-recipes

# Copy our agent configs

cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

# Install dependencies

uv sync

```

Running Benchmarks

Test a specific branch:

```bash

# Update install.dockerfile to use specific branch

# Then run benchmark

cd ~/eval-recipes

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

```

Compare before/after:

```bash

# Test baseline (main)

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

# Test PR branch (edit install.dockerfile to checkout PR branch)

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

# Compare scores

```

Available Tasks

Common tasks from eval-recipes:

linkedin_drafting - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
email_drafting - Create CLI tool for emails (scored 26/100 before)
arxiv_paper_summarizer - Research tool
github_docs_extractor - Documentation tool
Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

Identify the branch/PR to test
Update agent config to use that branch:

```dockerfile

# In .claude/agents/eval-recipes/amplihack/install.dockerfile

RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \

cd /tmp/amplihack && \

git checkout BRANCH_NAME && \

pip install -e .

```

Copy to eval-recipes:

```bash

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

```

Run benchmark:

```bash

cd ~/eval-recipes

uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

```

Report scores and compare with baseline

Expected Scores

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout feat/issue-1435-task-classification
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

eval-recipes cloned to ~/eval-recipes
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
Docker installed (for containerized runs)
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

```bash

# Test suite for a PR

tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"

for task in $tasks; do

uv run eval_recipes/main.py --agent amplihack --task $task --trials 3

done

# Compare results

cat .benchmark_results//amplihack//score.txt

```

More from this repository10

🎯

cybersecurity-analyst🎯Skill

Performs comprehensive cybersecurity analysis by evaluating events through threat modeling, risk assessment, and defensive frameworks to identify vulnerabilities and recommend mitigation strategies.

🎯

lawyer-analyst🎯Skill

lawyer-analyst skill from rysweet/amplihack

🎯

philosopher-analyst🎯Skill

philosopher-analyst skill from rysweet/amplihack

🎯

documentation-writing🎯Skill

documentation-writing skill from rysweet/amplihack

🎯

psychologist-analyst🎯Skill

psychologist-analyst skill from rysweet/amplihack