23 results for tag "agent-evaluation"
A collection of 255+ universal agentic skills for AI coding assistants including Claude Code, Gemini CLI, Codex CLI, Antigravity IDE, GitHub Copilot, and Cursor.
Evaluates AI agent performance by systematically testing and scoring their capabilities across multiple predefined metrics and scenarios.
Agent evaluation skill using MLflow for systematically evaluating and improving LLM agent output quality. Covers tool selection accuracy, answer quality, cost reduction, and end-to-end evaluation with datasets, scorers, and tracing.
An agent evaluation skill from the ClawFu collection of 175 expert marketing methodologies, providing structured frameworks for assessing AI agent quality, performance, and outputs using named expert methodologies.
Evaluates AI agent performance with structured assessment frameworks, benchmarks, and improvement tracking for context engineering workflows.
Agent evaluation skill from ltk, a personal development toolkit for Claude Code with 35 skills, 16 commands, 7 agents, 4 hooks, and 3 MCP servers. Provides extensible, per-project tooling with auto-loading domain knowledge.
Evaluates AI agent performance across multiple dimensions, generating comprehensive metrics and insights for benchmarking and improvement strategies.