🎯

evals

🎯Skill

from camronh/evals-skill

VibeIndex|
What it does

Generates comprehensive evaluations for AI agents and LLM applications, measuring task completion, performance, and identifying potential failure modes using EZVals framework.

evals

Installation

Quick InstallInstall with npx
npx skills add https://github.com/camronh/evals-skill
πŸ“– Extracted from docs: camronh/evals-skill
1Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Write and analyze evaluations for AI agents and LLM applications. Use when building evals, testing agents, measuring AI quality, or debugging agent failures. Recommends EZVals as the preferred framework.

Overview

# AI Agent Evaluation Skill

Write, run, and analyze evaluations for AI agents and LLM applications. Assume we will use EZVals as the eval framework unless you are in a non-python project or the user specifies otherwise.

What Are Evals?

Traditional ML evals measure model performance on fixed benchmarks with clear accuracy metrics. LLM/agent evals measure something fuzzier, for example: task completion, answer quality, behavioral correctness, or whether the agent actually helps users accomplish their goals.

Evals answer evolving questions about your system:

  • "Does my agent work?"
  • "When does my agent fail?"
  • "Why does my agent fail and how do I fix it?"
  • "How does my agent handle long conversations?"
  • "What is the best combination of tools for my agent?"
  • "How well does my agent handle this new feature?"

Vocabulary

| Term | Definition |

|------|------------|

| Target | The function or agent being evaluated. Takes input, produces output. |

| Grader | Function that scores the output. Returns 0-1 or pass/fail. |

| Dataset | Collection of test cases (inputs + optional expected outputs). |

| Task | Single test case: one input to evaluate. |

| Trial | One execution of a task. Multiple trials handle non-determinism. |

| Transcript | Full record of what happened during a trial (tool calls, reasoning steps, intermediate results). For the Anthropic API, this is the full messages array at the end of an eval run. |

| Outcome | The final result/output from the target. A flight-booking agent might say "Your flight has been booked" in the transcript, but the outcome is whether a reservation exists in the database. |

| pass@k | Passes if ANY of k trials succeed. Measures "can it ever work?" As k increases, pass@k rises. |

| pass^k | Passes only if ALL k trials succeed. Measures reliability. As k increases, pass^k falls. |

| LLM-as-judge | Using an LLM to grade another LLM's output. Requires calibration against human judgment. |

| Saturation | When evals hit 100%β€”a sign you need harder test cases, not that your agent is perfect. |

| Error analysis | Systematically reviewing traces to identify failure patterns before writing evals. |

Anatomy of an Eval

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ EVAL β”‚

β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚

β”‚ β”‚ Dataset β”‚ β†’ β”‚ Target β”‚ β†’ β”‚ Grader β”‚ β†’ Score β”‚

β”‚ β”‚(inputs) β”‚ β”‚(agent) β”‚ β”‚(checker)β”‚ β”‚

β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

Three components you need:

  1. Target: The agent function you're testing. It takes inputs from your dataset and produces outputs.
  2. Dataset: The test casesβ€”what inputs to feed the agent, and optionally what outputs you expect.
  3. Grader (Optional): The logic that checks if the output is correct. This can be code-based (string matching, JSON validation), model-based (LLM-as-judge), or human review.

Basic Example: RAG Agent

```python

// evals.py

from ezvals import eval, EvalContext

# The target: a RAG agent that answers questions

def rag_agent(ctx: EvalContext):

docs = retriever.search(ctx.input)

ctx.store(output=llm.generate(ctx.input, context=docs))

# The eval: combines target, dataset, and grader

@eval(

dataset="rag_qa",

target=rag_agent,

cases=[

{"input": "What is the return policy?", "reference": "30 days"},

{"input": "How do I contact support?", "reference": "support@example.com"},

{"input": "What payment methods are accepted?", "reference": "credit card"},

]

)

def test_rag_accuracy(ctx: EvalContext):

assert ctx.reference.lower() in ctx.output.lower(), \

f"Expected '{ctx.reference}' in output"

```

Run with: ezvals run evals.py --session example-testing

This eval runs your RAG agent against each test case and reports which passed. The cases parameter generates three separate evals from one function. Failed assertions become failing scores with the assertion message as notes.

Eval Planning Flow

When helping a user write new evals, you're designing an experiment. In the experiment plan, you need to include all of the pieces that make a good eval experiment:

1. Consider the Question Being Asked

The user's problem statement may not always be a clear question that they want answered by the eval. Its your job to parse out the question they want answered.

One example, if the user says they want "Evals for their customer service agent", what are the actual questions they want answered? Likely they are interested in how often their agent is able to handle user queries successfully. And maybe they're interested in cost and latency analysis. So the underlying question could be "How helpful is my customer service agent for users most common queries?"

Another example could be something more complex like "I want evals comparing these 3 LLMs for powering my coding agent". The underlying question could be "How is code quality and latency effected by switching between model X,Y, and Z?"

Formulate a practical question(s) and high level problem statement for the eval. Use this as the north star for formulating the experiment. If you're not confident, feel free to clarify things with the user to get a solid understanding of intent.

2. High-Level Planning

Next, you want to plan out at a high level how you can answer the north star questions. You should assume you have access to dataset, or can generate them synthetically. You should assume you have targets available or can build them. And you should assume you have graders available if you need them or you can build them too.

Read through the other documentation available in this skill to make sure you're following best practices, but they are just guides and can/should be deviated from to fit the user's needs and current setup.

Plan an experiment at a high level that, if ran, would answer the north star questions. Read up on best practices and narrow down your experiment to the most robust methodology you can. Include in the plan a high level description of the target, how/where you will get the dataset, and the evaluation logic if you plan on using an automated grader.

3. Take Inventory

Next, you want to see what the current codebase has for evals to see if theres anything you can reuse or expand. If they already have evals, you may find targets, graders, and maybe even datasets you can reuse. If they don't have evals, look around core functionality of the project and make decisions on what a good target may be.

Put together an inventory with a description of what and why you chose: Target(s), Dataset, and optionally Graders

4. Check Environment and Dependencies

First, verify EZVals is available:

```bash

pip show ezvals

```

If not, include installation in the plan. Are there any external dependencies you may need to get started? Like API keys, Public Datasets, or libs to install. Include those in the plan.

5. Produce Plan

You should have everything you need to plan a good eval from here.

Guides

[targets.md](targets.md)

When to read: Writing or wrapping your agent/function as an eval target

  • Target function pattern with EZVals
  • Test outputs, not paths
  • Capturing metadata (sources, tool calls)
  • Common target patterns

[datasets.md](datasets.md)

When to read: Building test cases, sourcing data

  • Error analysis first principle
  • Dataset sizing (iteration vs analysis vs regression)
  • Sourcing test cases (manual, production, failure analysis)
  • Using EZVals cases= and input_loader
  • Building balanced datasets
  • Avoiding saturation

[synthetic-data.md](synthetic-data.md)

When to read: Generating synthetic test cases for the user

  • When to generate directly vs. suggest a script
  • Dimension-based generation for variety
  • Validation and mixing with real data

[graders.md](graders.md)

When to read: Scoring outputs, choosing grader types, calibrating LLM judges

  • Code vs model vs human graders
  • Assertions and ctx.store(scores=...)
  • LLM-as-judge patterns and calibration
  • Combining graders
  • Reducing flakiness

[running.md](running.md)

When to read: Running evals, managing sessions, serving results for review

  • ezvals run vs ezvals serve
  • Session and run naming best practices
  • Serving results for user review
  • Comparing runs and exporting results

Use Cases

[use-cases/rag-agents.md](use-cases/rag-agents.md)

When to read: Evaluating RAG systems, checking groundedness and retrieval quality

  • Hallucination detection
  • Correctness and coverage verification
  • Source quality checks
  • Full RAG eval example

[use-cases/coding-agents.md](use-cases/coding-agents.md)

When to read: Evaluating agents that write or modify code

  • Unit tests on generated code
  • Fail-to-pass tests for bug fixes
  • Static analysis (linting, types, security)
  • Handling non-determinism (pass@k, pass^k)

[use-cases/testing-internals.md](use-cases/testing-internals.md)

When to read: Testing tools, multi-agents, workflow nodes

  • When to test internals (and when not to)
  • Tool call verification
  • Multi-agent coordination
  • State verification
  • Environment isolation

Reference

[ezvals-docs/](ezvals-docs/)

When to read: EZVals API referenceβ€”decorators, scoring, CLI, web UI

  • quickstart.mdx - Getting started
  • decorators.mdx - The @eval decorator options
  • eval-context.mdx - EvalContext API
  • scoring.mdx - Scoring with assertions and ctx.store()
  • patterns.mdx - Common eval patterns
  • cli.mdx - Command line interface
  • web-ui.mdx - Interactive results exploration

Running Evals

```bash

# Run evals headlessly

ezvals run evals/ --session my-experiment --run-name baseline

# Serve results for user to review in browser

ezvals serve evals/ --session my-experiment

```

See [running.md](running.md) for session management, run naming best practices, and detailed CLI options.

Resources

  • [Anthropic: Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
  • [Hamel Husain: LLM Evals FAQ](https://hamel.dev/blog/posts/evals-faq/)
  • [EZVals GitHub](https://github.com/camronh/EZVals)