🎢 Your Eval Workflow
1. Design your test cases 🎨
Running skill-eval init creates a handy evals/evals.json file in your skill directory. Pop open that file and add 2-3 realistic prompts. Don’t worry about making it perfect yet—just get a feel for how it works!
{
"skill_name": "csv-analyzer",
"evals": [
{
"id": 1,
"prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
"expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
"files": ["evals/files/sales_2025.csv"]
}
]
}
💡 Tips for awesome prompts:
- Mix it up! Try casual phrasing (“clean this up”) and precise phrasing.
- Edge cases matter! Throw in a tricky edge-case prompt to see how your skill handles it.
- Keep it real! Mention real file paths, column names, and context.
- Start small! Just 2-3 test cases are plenty for your first loop.
2. Write your assertions ✅
Assertions are simple pass/fail statements about what your output should look like. It’s best to add these after your first run, so you know what “good” actually looks like!
"assertions": [
"The output includes a bar chart image file",
"The chart shows exactly 3 months",
"Both axes are labeled",
"The chart title or caption mentions revenue"
]
Keep your assertions specific and observable (like “a file named results.csv exists”). Try to avoid vague statements (“the output is good”) or super brittle ones (“it says exactly ‘Total Revenue: $X’”).
Deterministic assertion matchers 🤖
For common checks, you can use prefix-based matchers that are evaluated locally instead of being sent to the LLM judge. These are faster, cheaper, and give consistent verdicts:
"assertions": [
"file_exists: results.csv",
"contains_text: summary.txt:Total revenue",
"matches_text: output.md:^## Summary",
"The chart uses a sensible color palette and is visually clear"
]
Supported prefixes:
| Prefix | Example | What it checks |
|---|---|---|
file_exists: |
file_exists: results.csv |
A file was produced in the output directory. |
contains_text: |
contains_text: summary.txt:Total revenue |
A file contains the given literal text. |
matches_text: |
matches_text: output.md:^## Summary |
A file matches the given regular expression. |
Any assertion without a prefix is sent to the LLM judge unchanged, so you can mix deterministic checks with open-ended judgement in the same eval.
3. Run the loop! 🔄
Ready? Let’s go!
skill-eval loop
This magic command handles the whole cycle for you:
- Run: Executes every eval twice (once with your skill, once as a baseline). It saves the outputs to
outputs/and notes the timing. - Grade: Asks the judge agent to check your assertions against the outputs, generating a nice
grading.jsonwith PASS/FAIL verdicts. - Benchmark: Gathers all the stats into
benchmark.jsonso you can easily see if your skill is pulling its weight!
💡 Want the tool to fix failures automatically? Try
skill-eval loop --fixand watch it re-run failing evals with the judge’s feedback as a critique! It’ll keep refining until things pass or it hits the attempt limit. Perfect for those “almost there” situations.
🔬 Shipping to multiple runtimes?
skill-eval loop --models pi:claude-sonnet,claude:opus-4-8,copilotruns every eval against each agent and tells you which one your skill helps most. Great for runtime-agnostic skills!
4. Review the results 🔍
Take a peek inside your workspace. For each eval, you’ll find:
| Artifact | What it tells you |
|---|---|
outputs/ |
The actual files generated—open them up and take a look! |
grading.json |
Which assertions passed or failed, along with the judge’s reasoning. |
benchmark.json |
All your stats and deltas in one place. |
Don’t forget to add your feedback! You can leave notes in feedback.json:
{
"eval-1": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
"eval-2": ""
}
Specific feedback is super helpful; vague feedback, not so much. If you leave it empty, it means everything looked great!
5. Spot the patterns 🧩
After grading, dive into the details:
- Toss out assertions that always pass: They aren’t telling you anything new!
- Fix assertions that always fail: The test might be too hard, or the assertion might be a bit off.
- Celebrate assertions that pass only with your skill: This is where your skill shines! Figure out which instructions made the magic happen.
- Check the outliers: If something took way too long, read the transcript to see where it got stuck.
6. Keep iterating! 🚀
Now you have the three golden signals: failed assertions, your feedback, and execution transcripts. Use them to tweak your skill!
Once you’ve made updates, run the loop again comparing against your previous version:
skill-eval loop --baseline previous
This creates an iteration-2/ directory. Keep iterating until you’re thrilled with the results and your feedback is completely empty!