hamelsmu

hamelsmu/evals-skills

7 resources in this repository

GitHub
🎯7

🎯Skills7

🎯eval-audit🎯Skill

Audits an LLM evaluation pipeline and surfaces problems with prioritized severity, guarding against common mistakes seen across 50+ companies, and recommends other skills to fix the issues found.

eval-audit
🎯write-judge-prompt🎯Skill

A skill from the Eval Skills collection that guides AI coding agents in building LLM evaluations, covering eval auditing, error analysis, judge prompt writing, RAG evaluation, and more. Guards against common mistakes observed across 50+ companies.

write-judge-prompt
🎯evaluate-rag🎯Skill

A skill from the Eval Skills plugin that helps AI coding agents build LLM evaluations, including eval auditing, error analysis, RAG evaluation, and judge prompt writing. Based on lessons learned from helping 50+ companies with their eval pipelines.

evaluate-rag
🎯error-analysis🎯Skill

Guides AI coding agents through reading LLM traces and systematically categorizing failures, part of the Eval Skills collection that helps build robust LLM evaluations based on lessons from 50+ companies.

error-analysis
🎯generate-synthetic-data🎯Skill

Creates diverse synthetic test inputs using dimension-based tuple generation for LLM evaluations, part of the Eval Skills collection that guards against common mistakes in eval pipeline construction.

generate-synthetic-data
🎯validate-evaluator🎯Skill

A collection of skills that guide AI coding agents to build LLM evaluations, including an eval audit skill that catches common mistakes, error analysis for reading traces and categorizing failures, and tools for validating evaluator quality.

validate-evaluator
🎯build-review-interface🎯Skill

A skill for building custom browser-based annotation interfaces to review LLM traces and collect structured human feedback, with pass/fail labeling, keyboard shortcuts, domain-appropriate data formatting, and Playwright verification testing.

build-review-interface