SE
Skill Eval

๐ŸŒฑ Running Your First Evaluation

Ready to see your skill put through its paces? This guide walks you through the full journey โ€” from a blank slate to a working eval loop โ€” in just a few minutes.


Before you start ๐Ÿ› ๏ธ

Make sure youโ€™ve got the CLI installed and your global config set up:

go install github.com/matt-riley/skill-evaluator@latest
skill-eval init --global

๐Ÿ’ก The global config lives at ~/.config/skill-eval/config.yaml. Itโ€™s where you set your default agent and model โ€” you only need to do this once!


Step 1 โ€” Scaffold your eval directory ๐Ÿ“

Navigate to your skill directory (the folder where your SKILL.md lives) and run:

skill-eval init

This creates an evals/ folder with a starter evals.json file. Open it up โ€” itโ€™s your canvas!

Hereโ€™s what a filled-out evals.json looks like:

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "Analyze sales_2025.csv and show me the top 3 months by revenue.",
      "expected_output": "A ranked list of the top 3 months with revenue totals.",
      "assertions": [
        "Output names exactly 3 months",
        "Months are sorted by revenue descending",
        "Revenue totals are shown"
      ]
    }
  ]
}

๐Ÿ’ก Quick tips for great evals:

  • Be specific! Vague prompts produce vague results.
  • Start small. Two or three evals is plenty for your first loop.
  • Add your assertions later โ€” run once without them first so you know what โ€œgoodโ€ actually looks like!

Step 2 โ€” Check your config โš™๏ธ

Drop a .skill-eval.yaml next to your SKILL.md to tell skill-eval which agent and model to use:

agent: pi
model: claude-sonnet-4-5

This overrides your global defaults for just this skill. Handy when youโ€™re comparing across models!


Step 3 โ€” Run the loop! ๐Ÿ”„

Hereโ€™s the magic command:

skill-eval loop

Watch it go! This single command handles the whole cycle:

  1. Run โ€” executes every eval twice: once with your skill active, once as a plain baseline.
  2. Grade โ€” asks the judge agent to check each assertion, producing a grading.json with PASS/FAIL verdicts.
  3. Benchmark โ€” rolls all the stats up into a benchmark.json so you can see the delta at a glance.


Step 4 โ€” Peek at the workspace ๐Ÿ‘€

After the loop finishes, your results land in <skill-name>-workspace/iteration-1/. Hereโ€™s what youโ€™ll find:

File What it tells you
outputs/ The actual agent responses โ€” open them up!
grading.json Which assertions passed or failed, with the judgeโ€™s reasoning.
benchmark.json The headline stats: pass rates, timing, and the with-skill vs baseline delta.
feedback.json Empty for now โ€” this is where you leave notes for next time.

Whatโ€™s next? ๐Ÿš€

Youโ€™re off to a great start โ€” keep iterating! โœจ