🎢 Your Eval Workflow

1. Design your test cases 🎨

Running skill-eval init creates a handy evals/evals.json file in your skill directory. Pop open that file and add 2-3 realistic prompts. Don’t worry about making it perfect yet—just get a feel for how it works!

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Can you find the top 3 months by revenue and make a bar chart?",
      "expected_output": "A bar chart image showing the top 3 months by revenue, with labeled axes and values.",
      "files": ["evals/files/sales_2025.csv"]
    }
  ]
}

💡 Tips for awesome prompts:

Mix it up! Try casual phrasing (“clean this up”) and precise phrasing.
Edge cases matter! Throw in a tricky edge-case prompt to see how your skill handles it.
Keep it real! Mention real file paths, column names, and context.
Start small! Just 2-3 test cases are plenty for your first loop.

2. Write your assertions ✅

Assertions are simple pass/fail statements about what your output should look like. It’s best to add these after your first run, so you know what “good” actually looks like!

"assertions": [
  "The output includes a bar chart image file",
  "The chart shows exactly 3 months",
  "Both axes are labeled",
  "The chart title or caption mentions revenue"
]

Keep your assertions specific and observable (like “a file named results.csv exists”). Try to avoid vague statements (“the output is good”) or super brittle ones (“it says exactly ‘Total Revenue: $X’”).

Deterministic assertion matchers 🤖

For common checks, you can use prefix-based matchers that are evaluated locally instead of being sent to the LLM judge. These are faster, cheaper, and give consistent verdicts:

"assertions": [
  "file_exists: results.csv",
  "contains_text: summary.txt:Total revenue",
  "matches_text: output.md:^## Summary",
  "The chart uses a sensible color palette and is visually clear"
]

Supported prefixes:

Prefix	Example	What it checks
`file_exists:`	`file_exists: results.csv`	A file was produced in the output directory.
`contains_text:`	`contains_text: summary.txt:Total revenue`	A file contains the given literal text.
`matches_text:`	`matches_text: output.md:^## Summary`	A file matches the given regular expression.

Any assertion without a prefix is sent to the LLM judge unchanged, so you can mix deterministic checks with open-ended judgement in the same eval.

3. Run the loop! 🔄

Ready? Let’s go!

skill-eval loop

This magic command handles the whole cycle for you:

Run: Executes every eval twice (once with your skill, once as a baseline). It saves the outputs to outputs/ and notes the timing.
Grade: Asks the judge agent to check your assertions against the outputs, generating a nice grading.json with PASS/FAIL verdicts.
Benchmark: Gathers all the stats into benchmark.json so you can easily see if your skill is pulling its weight!

💡 Want the tool to fix failures automatically? Try skill-eval loop --fix and watch it re-run failing evals with the judge’s feedback as a critique! It’ll keep refining until things pass or it hits the attempt limit. Perfect for those “almost there” situations.

🔬 Shipping to multiple runtimes? skill-eval loop --models pi:claude-sonnet,claude:opus-4-8,copilot runs every eval against each agent and tells you which one your skill helps most. Great for runtime-agnostic skills!

4. Review the results 🔍

Take a peek inside your workspace. For each eval, you’ll find:

Artifact	What it tells you
`outputs/`	The actual files generated—open them up and take a look!
`grading.json`	Which assertions passed or failed, along with the judge’s reasoning.
`benchmark.json`	All your stats and deltas in one place.

Don’t forget to add your feedback! You can leave notes in feedback.json:

{
  "eval-1": "The chart is missing axis labels and the months are in alphabetical order instead of chronological.",
  "eval-2": ""
}

Specific feedback is super helpful; vague feedback, not so much. If you leave it empty, it means everything looked great!

5. Spot the patterns 🧩

After grading, dive into the details:

Toss out assertions that always pass: They aren’t telling you anything new!
Fix assertions that always fail: The test might be too hard, or the assertion might be a bit off.
Celebrate assertions that pass only with your skill: This is where your skill shines! Figure out which instructions made the magic happen.
Check the outliers: If something took way too long, read the transcript to see where it got stuck.

6. Keep iterating! 🚀

Now you have the three golden signals: failed assertions, your feedback, and execution transcripts. Use them to tweak your skill!

Once you’ve made updates, run the loop again comparing against your previous version:

skill-eval loop --baseline previous

This creates an iteration-2/ directory. Keep iterating until you’re thrilled with the results and your feedback is completely empty!