12 results for tag "llm-evaluation"
A production-ready plugin system with 112 AI agents, 146 skills, 16 workflow orchestrators, and 79 development tools organized into 73 focused plugins for Claude Code.
Evaluates LLM applications systematically using automated metrics, human feedback, and comparative techniques to measure performance and quality.
Evaluates LLM performance systematically using automated metrics, human feedback, and benchmarking techniques across various dimensions.
A skill for LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Covers prompt regression testing, security testing (red teaming, jailbreaks), model performance comparison, and building evaluation suites for RAG, factuality, or safety.