Manual evals that we run to verify the skill improves the agent's performance. We check that the skill was invoked and that the result was correct.
Methodology
- Model: Sonnet 4.5
- Agent: Claude Code
Ran each prompt with claude $PROMPT. Auto accepted any obvious tool use but cancelled if the agent presented options.
Ran each test three times and recorded pass/fail.
dbc
- Installed Skills:
dbc - Setup: empty dir, dbc is not on system, no drivers installed with dbc, uv and pipx available
#### Prompt: "install dbc"
| Without | With: Skill Invoked? | With: Task Successful? |
| ------- | -------------------- | ---------------------- |
| 1/3 | 3/3 | 3/3 |
Notes: Without the skill, only sometimes will correctly guess we want to install the dbc PyPI package
#### Prompt: "install the sqlite ADBC driver"
Criterion: Should use dbc to install the sqlite driver
| Without | With: Skill Invoked? | With: Task Successful? |
| ------- | -------------------- | ---------------------- |
| 0/3 | 3/3 | 3/3 |
Notes: Without the skill, Claude will always install the PyPI package
#### Prompt: "uninstall the sqlite ADBC driver"
Setup: dbc is available, sqlite is installed
| Without | With: Skill Invoked? | With: Task Successful? |
| ------- | -------------------- | ---------------------- |
| 0/3 | 3/3 | 3/3 |
#### Prompt: "create a dbc driver list with the sqlite and postgresql ADBC drivers"
| Without | With: Skill Invoked? | With: Task Successful? |
| ------- | -------------------- | ---------------------- |
| 0/3 | 3/3 | 3/3 |