๐ฑ Running Your First Evaluation
Ready to see your skill put through its paces? This guide walks you through the full journey โ from a blank slate to a working eval loop โ in just a few minutes.
Before you start ๐ ๏ธ
Make sure youโve got the CLI installed and your global config set up:
go install github.com/matt-riley/skill-evaluator@latest
skill-eval init --global
๐ก The global config lives at
~/.config/skill-eval/config.yaml. Itโs where you set your default agent and model โ you only need to do this once!
Step 1 โ Scaffold your eval directory ๐
Navigate to your skill directory (the folder where your SKILL.md lives) and run:
skill-eval init
This creates an evals/ folder with a starter evals.json file. Open it up โ itโs your canvas!
Hereโs what a filled-out evals.json looks like:
{
"skill_name": "csv-analyzer",
"evals": [
{
"id": 1,
"prompt": "Analyze sales_2025.csv and show me the top 3 months by revenue.",
"expected_output": "A ranked list of the top 3 months with revenue totals.",
"assertions": [
"Output names exactly 3 months",
"Months are sorted by revenue descending",
"Revenue totals are shown"
]
}
]
}
๐ก Quick tips for great evals:
- Be specific! Vague prompts produce vague results.
- Start small. Two or three evals is plenty for your first loop.
- Add your assertions later โ run once without them first so you know what โgoodโ actually looks like!
Step 2 โ Check your config โ๏ธ
Drop a .skill-eval.yaml next to your SKILL.md to tell skill-eval which agent and model to use:
agent: pi
model: claude-sonnet-4-5
This overrides your global defaults for just this skill. Handy when youโre comparing across models!
Step 3 โ Run the loop! ๐
Hereโs the magic command:
skill-eval loop
Watch it go! This single command handles the whole cycle:
- Run โ executes every eval twice: once with your skill active, once as a plain baseline.
- Grade โ asks the judge agent to check each assertion, producing a
grading.jsonwith PASS/FAIL verdicts. - Benchmark โ rolls all the stats up into a
benchmark.jsonso you can see the delta at a glance.
Step 4 โ Peek at the workspace ๐
After the loop finishes, your results land in <skill-name>-workspace/iteration-1/. Hereโs what youโll find:
| File | What it tells you |
|---|---|
outputs/ |
The actual agent responses โ open them up! |
grading.json |
Which assertions passed or failed, with the judgeโs reasoning. |
benchmark.json |
The headline stats: pass rates, timing, and the with-skill vs baseline delta. |
feedback.json |
Empty for now โ this is where you leave notes for next time. |
Whatโs next? ๐
- Head over to Reading Your Results to learn how to interpret
benchmark.json. - When something fails, Giving Feedback shows you exactly what to write and why it matters.
Youโre off to a great start โ keep iterating! โจ