22 results for tag "agent-evaluation"
Evaluates AI agent performance, capabilities, and effectiveness through systematic assessment and scoring methodologies.
A collection of 255+ universal agentic skills for AI coding assistants including Claude Code, Gemini CLI, Codex CLI, Antigravity IDE, GitHub Copilot, and Cursor.
Evaluates AI agent performance by systematically testing and scoring their capabilities across multiple predefined metrics and scenarios.
Agent evaluation skill using MLflow for systematically evaluating and improving LLM agent output quality. Covers tool selection accuracy, answer quality, cost reduction, and end-to-end evaluation with datasets, scorers, and tracing.
Agent evaluation skill from ltk, a personal development toolkit for Claude Code with 35 skills, 16 commands, 7 agents, 4 hooks, and 3 MCP servers. Provides extensible, per-project tooling with auto-loading domain knowledge.
Evaluates AI agent performance with structured assessment frameworks, benchmarks, and improvement tracking for context engineering workflows.
Evaluates AI agent performance across multiple dimensions, generating comprehensive metrics and insights for benchmarking and improvement strategies.