llm-evaluation
๐ฏSkillfrom wshobson/agents
A production-ready plugin system with 112 AI agents, 146 skills, 16 workflow orchestrators, and 79 development tools organized into 73 focused plugins for Claude Code.
Overview
A Claude Code skill from the wshobson/agents plugin marketplace that provides specialized knowledge for evaluating and benchmarking large language models. It is part of a comprehensive system of 112 specialized AI agents and 146 agent skills organized into 73 focused plugins, optimized for minimal token usage.
Key Features
- AI/ML Domain Expertise - Backed by a specialized agent with deep knowledge in LLM evaluation methodologies, benchmarking, and quality assessment
- Granular Plugin Design - Loads only LLM evaluation-related components, keeping context focused on model assessment
- Progressive Disclosure - Evaluation knowledge activates only when needed, maintaining efficient context management
- Composable with AI Skills - Designed to work alongside other data/AI and machine learning plugins for comprehensive ML workflows
- Production-Ready Evaluation Patterns - Provides tested patterns for prompt evaluation, model comparison, and quality metrics
Who is this for?
This skill is designed for AI/ML engineers and researchers who need structured guidance on evaluating large language model performance and quality. It is particularly useful for teams selecting between models, building evaluation harnesses, or establishing quality benchmarks for their LLM-powered applications.
Same repository
wshobson/agents(234 items)
Installation
npx vibeindex add wshobson/agents --skill llm-evaluationnpx skills add wshobson/agents --skill llm-evaluation~/.claude/skills/llm-evaluation/SKILL.mdSKILL.md
More from this repository10
The ui-design plugin is part of the wshobson/agents marketplace for Claude Code, providing specialized AI agents for UI/UX design assistance within development workflows.
The data-validation-suite plugin is part of the wshobson/agents marketplace for Claude Code. It falls under the Data category, which includes two data-focused plugins: data engineering and data validation.
A Claude Code plugin from the wshobson/agents marketplace for deployment validation, providing specialized AI agents and tools to ensure reliable production deployments within a 73-plugin ecosystem.
Shell Scripting is a Claude Code plugin from the wshobson/agents marketplace that provides AI-powered assistance for writing and maintaining shell scripts.
An MLOps plugin from the wshobson/agents ecosystem providing Claude Code with specialized agents and skills for ML pipeline management, model deployment, experiment tracking, and production monitoring.
A Claude Code plugin with specialized AI agents for accessibility compliance auditing, WCAG standards verification, and remediation guidance in web and mobile applications.
The reverse-engineering plugin is part of the wshobson/agents marketplace for Claude Code, providing specialized AI agents for code analysis, binary examination, and system reverse engineering tasks.
A Claude Code plugin for CI/CD automation with 4 specialized skills covering pipeline design, GitHub Actions, GitLab CI, and secrets management, part of the wshobson/agents marketplace.
The functional-programming plugin is part of the wshobson/agents marketplace for Claude Code. It falls under the Languages category, which includes seven language-focused plugins covering Python, JavaScript/TypeScript, systems programming, JVM, sc...
Comprehensive Review is a Claude Code plugin from the wshobson/agents marketplace that provides multi-perspective code analysis covering architecture, security, and best practices.