SE
Skill Eval

🪄 Auto-Fixing Failing Evals

Sometimes your skill is almost there — one assertion keeps failing, and you know exactly what the agent got wrong. Instead of tweaking your SKILL.md and re-running the whole loop by hand, let skill-eval handle it for you!

Meet the --fix flag. It turns skill-eval loop from a single-pass pipeline into an iterative refinement loop. The judge’s feedback becomes the agent’s critique, and the agent gets another crack at it — automatically!


How it works 🔄

Normally, skill-eval loop does run → grade → benchmark and stops. With --fix, it adds an extra phase:

run → grade → fix (repeat until passes) → benchmark

Here’s the play-by-play:

  1. Run & grade as usual — all evals execute, the judge scores them.
  2. Find the failures — any with-skill eval that didn’t pass every assertion gets flagged.
  3. Extract the critique — the judge’s evidence from each FAIL becomes the agent’s “here’s what you got wrong” note.
  4. Re-run with feedback — the agent tries again, this time with the critique in its prompt.
  5. Re-grade — the judge evaluates the new attempt.
  6. Keep going until…
    • All assertions pass ✅
    • The same assertions fail twice in a row (plateau detected — time to try a different approach!)
    • The attempt budget runs out (default: 3 attempts)

The best attempt wins, and its grading replaces the original!


Using it 🚀

It couldn’t be simpler:

skill-eval loop --fix

That’s it! skill-eval will refine each failing eval up to 3 times by default. Want more patience?

skill-eval loop --fix --max-fix-attempts 5

The baseline path is never fixed — it’s your control group. Only the with-skill side gets the auto-refinement treatment.


What you’ll see 👀

During the fix phase, skill-eval reports every attempt:

[3/4] Auto-fixing failed evals...
  eval 1: already passing, skipping
  eval 2: 1/2 failed — fixing...
    attempt 1: 1/2 passed
    attempt 2: 2/2 passed ✓
    best: attempt 2 (100% pass)

And inside your workspace, each fix attempt gets its own directory:

iteration-1/
  eval-2/
    with_skill/
      outputs/          ← initial attempt
      grading.json      ← best result (promoted from fix-2)
      fix-2/
        outputs/        ← first fix attempt
        grading.json
        timing.json
      fix-3/
        outputs/        ← second fix attempt (the winner!)
        grading.json
        timing.json
      fix-results.json  ← full trajectory log

💡 The initial grading is recorded as attempt 1, so fix directories start at fix-2.


When to use --fix vs doing it yourself 🤔

Use --fix when… Iterate manually when…
The fix is obvious — the agent just needs a nudge The core skill instructions are wrong
You’re iterating fast and want a quick improvement You need to rethink your assertions
One or two assertions keep failing predictably The whole approach needs re-architecting
You want to see if a small prompt tweak resolves it You’re adding new test cases or changing expected output

Think of --fix as your fast-feedback shortcut. It’s not a replacement for careful skill design — it’s a turbo boost for those “almost there” moments! 🏎️


Where the critique comes from 🔍

The critique is extracted directly from the judge’s evidence field. If your grading prompt produces clear, specific evidence (like “Chart shows 5 months instead of the requested 3”), the agent gets actionable feedback. Vague evidence (“Output was incorrect”) leads to vague fixes.

That’s another reason to write good assertions and a solid grading prompt — the fix loop is only as smart as your judge!


What’s next?

  • Giving Feedback — For when you want to add human notes to the judge’s context.
  • Eval Workflow — The full picture of how all the pieces connect.