🎯

evals

🎯Skill

from camronh/evals-skill

What it does

Generates comprehensive evaluations for AI agents and LLM applications, measuring task completion, performance, and identifying potential failure modes using EZVals framework.

evals

Installation

Quick InstallInstall with npx

npx skills add https://github.com/camronh/evals-skill

📖 Extracted from docs: camronh/evals-skill

Need more details? View full documentation on GitHub →

1Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Write and analyze evaluations for AI agents and LLM applications. Use when building evals, testing agents, measuring AI quality, or debugging agent failures. Recommends EZVals as the preferred framework.

Overview

# AI Agent Evaluation Skill

Write, run, and analyze evaluations for AI agents and LLM applications. Assume we will use EZVals as the eval framework unless you are in a non-python project or the user specifies otherwise.

What Are Evals?

Traditional ML evals measure model performance on fixed benchmarks with clear accuracy metrics. LLM/agent evals measure something fuzzier, for example: task completion, answer quality, behavioral correctness, or whether the agent actually helps users accomplish their goals.

Evals answer evolving questions about your system:

"Does my agent work?"
"When does my agent fail?"
"Why does my agent fail and how do I fix it?"
"How does my agent handle long conversations?"
"What is the best combination of tools for my agent?"
"How well does my agent handle this new feature?"

Vocabulary

| Term | Definition |

|------|------------|

| Target | The function or agent being evaluated. Takes input, produces output. |

| Grader | Function that scores the output. Returns 0-1 or pass/fail. |

| Dataset | Collection of test cases (inputs + optional expected outputs). |

| Task | Single test case: one input to evaluate. |

| Trial | One execution of a task. Multiple trials handle non-determinism. |

| Transcript | Full record of what happened during a trial (tool calls, reasoning steps, intermediate results). For the Anthropic API, this is the full messages array at the end of an eval run. |

| Outcome | The final result/output from the target. A flight-booking agent might say "Your flight has been booked" in the transcript, but the outcome is whether a reservation exists in the database. |

| pass@k | Passes if ANY of k trials succeed. Measures "can it ever work?" As k increases, pass@k rises. |

| pass^k | Passes only if ALL k trials succeed. Measures reliability. As k increases, pass^k falls. |

| LLM-as-judge | Using an LLM to grade another LLM's output. Requires calibration against human judgment. |

| Saturation | When evals hit 100%—a sign you need harder test cases, not that your agent is perfect. |

| Error analysis | Systematically reviewing traces to identify failure patterns before writing evals. |

Anatomy of an Eval

```

┌─────────────────────────────────────────────────────┐

│ EVAL │

│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │

│ │ Dataset │ → │ Target │ → │ Grader │ → Score │

│ │(inputs) │ │(agent) │ │(checker)│ │

│ └─────────┘ └─────────┘ └─────────┘ │

└─────────────────────────────────────────────────────┘

```

Three components you need:

Target: The agent function you're testing. It takes inputs from your dataset and produces outputs.
Dataset: The test cases—what inputs to feed the agent, and optionally what outputs you expect.
Grader (Optional): The logic that checks if the output is correct. This can be code-based (string matching, JSON validation), model-based (LLM-as-judge), or human review.

Basic Example: RAG Agent

```python

// evals.py

from ezvals import eval, EvalContext

# The target: a RAG agent that answers questions

def rag_agent(ctx: EvalContext):

docs = retriever.search(ctx.input)

ctx.store(output=llm.generate(ctx.input, context=docs))

# The eval: combines target, dataset, and grader

@eval(

dataset="rag_qa",

target=rag_agent,

cases=[

{"input": "What is the return policy?", "reference": "30 days"},

{"input": "How do I contact support?", "reference": "support@example.com"},

{"input": "What payment methods are accepted?", "reference": "credit card"},

]

)

def test_rag_accuracy(ctx: EvalContext):

assert ctx.reference.lower() in ctx.output.lower(), \

f"Expected '{ctx.reference}' in output"

```

Run with: ezvals run evals.py --session example-testing

This eval runs your RAG agent against each test case and reports which passed. The cases parameter generates three separate evals from one function. Failed assertions become failing scores with the assertion message as notes.

Eval Planning Flow

When helping a user write new evals, you're designing an experiment. In the experiment plan, you need to include all of the pieces that make a good eval experiment:

1. Consider the Question Being Asked

The user's problem statement may not always be a clear question that they want answered by the eval. Its your job to parse out the question they want answered.

One example, if the user says they want "Evals for their customer service agent", what are the actual questions they want answered? Likely they are interested in how often their agent is able to handle user queries successfully. And maybe they're interested in cost and latency analysis. So the underlying question could be "How helpful is my customer service agent for users most common queries?"

Another example could be something more complex like "I want evals comparing these 3 LLMs for powering my coding agent". The underlying question could be "How is code quality and latency effected by switching between model X,Y, and Z?"

Formulate a practical question(s) and high level problem statement for the eval. Use this as the north star for formulating the experiment. If you're not confident, feel free to clarify things with the user to get a solid understanding of intent.

2. High-Level Planning

Next, you want to plan out at a high level how you can answer the north star questions. You should assume you have access to dataset, or can generate them synthetically. You should assume you have targets available or can build them. And you should assume you have graders available if you need them or you can build them too.

Read through the other documentation available in this skill to make sure you're following best practices, but they are just guides and can/should be deviated from to fit the user's needs and current setup.

Plan an experiment at a high level that, if ran, would answer the north star questions. Read up on best practices and narrow down your experiment to the most robust methodology you can. Include in the plan a high level description of the target, how/where you will get the dataset, and the evaluation logic if you plan on using an automated grader.

3. Take Inventory

Next, you want to see what the current codebase has for evals to see if theres anything you can reuse or expand. If they already have evals, you may find targets, graders, and maybe even datasets you can reuse. If they don't have evals, look around core functionality of the project and make decisions on what a good target may be.

Put together an inventory with a description of what and why you chose: Target(s), Dataset, and optionally Graders

4. Check Environment and Dependencies

First, verify EZVals is available:

```bash

pip show ezvals

```

If not, include installation in the plan. Are there any external dependencies you may need to get started? Like API keys, Public Datasets, or libs to install. Include those in the plan.

5. Produce Plan

You should have everything you need to plan a good eval from here.

Guides

[targets.md](targets.md)

When to read: Writing or wrapping your agent/function as an eval target

Target function pattern with EZVals
Test outputs, not paths
Capturing metadata (sources, tool calls)
Common target patterns

[datasets.md](datasets.md)

When to read: Building test cases, sourcing data

Error analysis first principle
Dataset sizing (iteration vs analysis vs regression)
Sourcing test cases (manual, production, failure analysis)
Using EZVals cases= and input_loader
Building balanced datasets
Avoiding saturation

[synthetic-data.md](synthetic-data.md)

When to read: Generating synthetic test cases for the user

When to generate directly vs. suggest a script
Dimension-based generation for variety
Validation and mixing with real data

[graders.md](graders.md)

When to read: Scoring outputs, choosing grader types, calibrating LLM judges

Code vs model vs human graders
Assertions and ctx.store(scores=...)
LLM-as-judge patterns and calibration
Combining graders
Reducing flakiness

[running.md](running.md)

When to read: Running evals, managing sessions, serving results for review

ezvals run vs ezvals serve
Session and run naming best practices
Serving results for user review
Comparing runs and exporting results

Use Cases

[use-cases/rag-agents.md](use-cases/rag-agents.md)

When to read: Evaluating RAG systems, checking groundedness and retrieval quality

Hallucination detection
Correctness and coverage verification
Source quality checks
Full RAG eval example

[use-cases/coding-agents.md](use-cases/coding-agents.md)

When to read: Evaluating agents that write or modify code

Unit tests on generated code
Fail-to-pass tests for bug fixes
Static analysis (linting, types, security)
Handling non-determinism (pass@k, pass^k)

[use-cases/testing-internals.md](use-cases/testing-internals.md)

When to read: Testing tools, multi-agents, workflow nodes

When to test internals (and when not to)
Tool call verification
Multi-agent coordination
State verification
Environment isolation

Reference

[ezvals-docs/](ezvals-docs/)

When to read: EZVals API reference—decorators, scoring, CLI, web UI

quickstart.mdx - Getting started
decorators.mdx - The @eval decorator options
eval-context.mdx - EvalContext API
scoring.mdx - Scoring with assertions and ctx.store()
patterns.mdx - Common eval patterns
cli.mdx - Command line interface
web-ui.mdx - Interactive results exploration

Running Evals

```bash

# Run evals headlessly

ezvals run evals/ --session my-experiment --run-name baseline

# Serve results for user to review in browser

ezvals serve evals/ --session my-experiment

```

See [running.md](running.md) for session management, run naming best practices, and detailed CLI options.

Resources

[Anthropic: Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
[Hamel Husain: LLM Evals FAQ](https://hamel.dev/blog/posts/evals-faq/)
[EZVals GitHub](https://github.com/camronh/EZVals)