🪄 Auto-Fixing Failing Evals
Sometimes your skill is almost there — one assertion keeps failing, and you know exactly what the agent got wrong. Instead of tweaking your SKILL.md and re-running the whole loop by hand, let skill-eval handle it for you!
Meet the --fix flag. It turns skill-eval loop from a single-pass pipeline into an iterative refinement loop. The judge’s feedback becomes the agent’s critique, and the agent gets another crack at it — automatically!
How it works 🔄
Normally, skill-eval loop does run → grade → benchmark and stops. With --fix, it adds an extra phase:
run → grade → fix (repeat until passes) → benchmark
Here’s the play-by-play:
- Run & grade as usual — all evals execute, the judge scores them.
- Find the failures — any with-skill eval that didn’t pass every assertion gets flagged.
- Extract the critique — the judge’s
evidencefrom each FAIL becomes the agent’s “here’s what you got wrong” note. - Re-run with feedback — the agent tries again, this time with the critique in its prompt.
- Re-grade — the judge evaluates the new attempt.
- Keep going until…
- All assertions pass ✅
- The same assertions fail twice in a row (plateau detected — time to try a different approach!)
- The attempt budget runs out (default: 3 attempts)
The best attempt wins, and its grading replaces the original!
Using it 🚀
It couldn’t be simpler:
skill-eval loop --fix
That’s it! skill-eval will refine each failing eval up to 3 times by default. Want more patience?
skill-eval loop --fix --max-fix-attempts 5
The baseline path is never fixed — it’s your control group. Only the with-skill side gets the auto-refinement treatment.
What you’ll see 👀
During the fix phase, skill-eval reports every attempt:
[3/4] Auto-fixing failed evals...
eval 1: already passing, skipping
eval 2: 1/2 failed — fixing...
attempt 1: 1/2 passed
attempt 2: 2/2 passed ✓
best: attempt 2 (100% pass)
And inside your workspace, each fix attempt gets its own directory:
iteration-1/
eval-2/
with_skill/
outputs/ ← initial attempt
grading.json ← best result (promoted from fix-2)
fix-2/
outputs/ ← first fix attempt
grading.json
timing.json
fix-3/
outputs/ ← second fix attempt (the winner!)
grading.json
timing.json
fix-results.json ← full trajectory log
💡 The initial grading is recorded as attempt 1, so fix directories start at
fix-2.
When to use --fix vs doing it yourself 🤔
Use --fix when… |
Iterate manually when… |
|---|---|
| The fix is obvious — the agent just needs a nudge | The core skill instructions are wrong |
| You’re iterating fast and want a quick improvement | You need to rethink your assertions |
| One or two assertions keep failing predictably | The whole approach needs re-architecting |
| You want to see if a small prompt tweak resolves it | You’re adding new test cases or changing expected output |
Think of --fix as your fast-feedback shortcut. It’s not a replacement for careful skill design — it’s a turbo boost for those “almost there” moments! 🏎️
Where the critique comes from 🔍
The critique is extracted directly from the judge’s evidence field. If your grading prompt produces clear, specific evidence (like “Chart shows 5 months instead of the requested 3”), the agent gets actionable feedback. Vague evidence (“Output was incorrect”) leads to vague fixes.
That’s another reason to write good assertions and a solid grading prompt — the fix loop is only as smart as your judge!
What’s next?
- Giving Feedback — For when you want to add human notes to the judge’s context.
- Eval Workflow — The full picture of how all the pieces connect.