🎯

validate-skill

🎯Skill

from plaited/agent-eval-harness

What it does

Validates reference solutions for prompts to ensure their correctness and integrity before running agent evaluation trials.

validate-skill

Installation

Install skill:

npx skills add https://github.com/plaited/agent-eval-harness --skill validate-skill

Last UpdatedJan 26, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Overview

# @plaited/agent-eval-harness

[![npm version](https://img.shields.io/npm/v/@plaited/agent-eval-harness.svg)](https://www.npmjs.com/package/@plaited/agent-eval-harness)

[![CI](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml)

[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)

CLI tool for capturing agent trajectories from headless CLI agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents.

CLI Tool

Use these tools directly via the CLI without installation:

```bash

# Using built-in headless adapter (recommended - no extra install needed)

export ANTHROPIC_API_KEY=sk-...

bunx @plaited/agent-eval-harness capture prompts.jsonl \

--schema ./schemas/claude-headless.json \

-o results.jsonl

```

Prerequisite: Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it:

```bash

export ANTHROPIC_API_KEY=sk-... # For Claude

export GEMINI_API_KEY=... # For Gemini

```

Pre-built schemas are available in .plaited/skills/headless-adapters/schemas/ for Claude and Gemini.

Core Commands

| Command | Description |

|---------|-------------|

| capture --schema | Trajectory capture (full JSONL) |

| trials --schema | Multi-run with pass@k metrics |

| summarize | Derive compact views from results |

| calibrate | Sample failures for review |

| validate-refs | Check reference solutions |

| balance | Analyze test set coverage |

| schemas [name] | Export JSON schemas |

| headless --schema | Schema-driven adapter for any CLI agent |

Pipeline Commands (Unix-style)

| Command | Description |

|---------|-------------|

| run --schema | Execute prompts, output raw results |

| extract --schema | Parse raw output into trajectories |

| grade --grader | Apply grader to extracted results |

| format --style