Use these tools directly via the CLI without installation:
```bash
# Using built-in headless adapter (recommended - no extra install needed)
export ANTHROPIC_API_KEY=sk-...
bunx @plaited/agent-eval-harness capture prompts.jsonl \
--schema ./schemas/claude-headless.json \
-o results.jsonl
```
Prerequisite: Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it:
```bash
export ANTHROPIC_API_KEY=sk-... # For Claude
export GEMINI_API_KEY=... # For Gemini
```
Pre-built schemas are available in .plaited/skills/headless-adapters/schemas/ for Claude and Gemini.
Core Commands
| Command | Description |
|---------|-------------|
| capture --schema | Trajectory capture (full JSONL) |
| trials --schema | Multi-run with pass@k metrics |
| summarize | Derive compact views from results |
| calibrate | Sample failures for review |
| validate-refs | Check reference solutions |
| balance | Analyze test set coverage |
| schemas [name] | Export JSON schemas |
| headless --schema | Schema-driven adapter for any CLI agent |
Pipeline Commands (Unix-style)
| Command | Description |
|---------|-------------|
| run --schema | Execute prompts, output raw results |
| extract --schema | Parse raw output into trajectories |
| grade --grader | Apply grader to extracted results |
| format --style | Convert to markdown, csv, or jsonl |
| compare ... | Compare runs (aggregate report) |
Examples
```bash
# Capture trajectories using headless adapter (recommended)
bunx @plaited/agent-eval-harness capture prompts.jsonl \
--schema ./schemas/claude-headless.json \
-o results.jsonl
# Run trials for pass@k analysis with debug mode
bunx @plaited/agent-eval-harness trials prompts.jsonl \
--schema ./schemas/claude-headless.json \
-k 5 --grader ./grader.ts --debug
# Summarize results
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl
# Export schemas
bunx @plaited/agent-eval-harness schemas CaptureResult --json
# Pipeline workflow (Unix-style composition)
cat prompts.jsonl | \
bunx @plaited/agent-eval-harness run -s ./schemas/claude-headless.json | \
bunx @plaited/agent-eval-harness extract -s ./schemas/claude-headless.json | \
bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
bunx @plaited/agent-eval-harness format -f markdown > report.md
# Compare runs (built-in strategies: weighted, statistical, custom)
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json
```