SE
Skill Eval

📊 Reading Your Results

After skill-eval loop finishes, two files tell you everything you need to know: benchmark.json for the big picture, and grading.json for the detail. Here’s how to read them like a pro!


The headline: benchmark.json 🏆

This is your scoreboard. Open it with:

cat <skill-name>-workspace/iteration-1/benchmark.json | jq .

For a single-model run, you’ll see something like this:

{
  "run_summary": {
    "with_skill": {
      "pass_rate": { "mean": 0.89, "stddev": 0.05 },
      "time_seconds": { "mean": 18.4, "stddev": 2.1 },
      "tokens": { "mean": 1250, "stddev": 120 }
    },
    "baseline": {
      "pass_rate": { "mean": 0.56, "stddev": 0.08 },
      "time_seconds": { "mean": 14.2, "stddev": 1.5 },
      "tokens": { "mean": 980, "stddev": 90 }
    },
    "delta": {
      "pass_rate": 0.33,
      "time_seconds": 4.2,
      "tokens": 270
    }
  },
  "generated_at": "2026-06-25T16:30:00Z"
}

What each field means

Field What it’s telling you
run_summary.with_skill Average stats across all evals when your skill was active
run_summary.baseline Same stats, but without your skill (the control group)
run_summary.delta The difference — this is the number you’re trying to maximise!
pass_rate Fraction of assertions that passed (mean and stddev)
time_seconds Average wall-clock time per eval
tokens Average total tokens consumed per eval

💡 A positive pass_rate delta is the goal. If it’s +0% or negative, your skill isn’t helping (or it’s actively hurting). Time to revisit your instructions!

Tracking progress across iterations 🔄

When a previous iteration exists, benchmark.json also includes a previous_iteration field and an iteration_delta:

{
  "previous_iteration": 1,
  "iteration_delta": {
    "pass_rate": 0.05,
    "time_seconds": 0.2,
    "tokens": -15
  }
}

iteration_delta is the current iteration’s delta minus the previous iteration’s delta. A positive pass_rate means your skill improved relative to baseline since the last run. A negative value means it got worse — a signal that a recent change may have hurt performance. Keep an eye on this to make sure you’re moving in the right direction! 📈


Per-eval breakdown 🔬

For the verdict on every individual eval, dig into the per-eval directories rather than benchmark.json. Each eval directory has its own grading.json:

cat <skill-name>-workspace/iteration-1/eval-1/with_skill/grading.json | jq .

Look for these patterns:

Pattern What it means
With-skill PASS, baseline FAIL 🌟 Your skill turned a failure into a success — this is what you’re looking for!
Both PASS or both FAIL Neither better nor worse — skill made no visible difference here.
With-skill FAIL, baseline PASS ⚠️ Skill over-constrained the agent. Your instructions may be too prescriptive.

The detail: grading.json 🧑‍⚖️

benchmark.json tells you that something failed. grading.json tells you why:

cat <skill-name>-workspace/iteration-1/grading.json | jq '.["eval-1"]'
{
  "overall": "FAIL",
  "assertions": [
    {
      "assertion": "Output includes a bar chart image",
      "result": "PASS"
    },
    {
      "assertion": "Chart shows exactly 3 months",
      "result": "FAIL",
      "reasoning": "Got 5 months — top-3 filter was not applied"
    },
    {
      "assertion": "Both axes are labeled",
      "result": "PASS"
    }
  ]
}

The reasoning field is gold. It’s the judge’s explanation for the verdict — read it carefully before updating your skill instructions or your feedback.


Common patterns to look for 🧩

All evals fail, with and without skill Your assertions might be too strict, or your eval prompts might need more context. Try relaxing one assertion at a time to isolate the issue.

Baseline always passes, with-skill always fails Your skill instructions are constraining the agent in a way that’s breaking something. Look for overly prescriptive steps in your SKILL.md.

With-skill passes, baseline fails (the dream!) This is exactly the signal you want. Note which instructions drove the improvement — that’s your skill earning its keep.

Skill is slower but passes more A timing trade-off. Usually fine! But if the extra time is significant, check whether your skill is prompting extra unnecessary steps.


Iterating from here 🔄

Once you’ve read your results:

  1. Add feedback for any failing evals — see Giving Feedback for how to write notes that actually help.
  2. Tweak your skill instructions based on what you learned from the reasoning fields.
  3. Run again to compare: skill-eval loop --baseline previous creates an iteration-2/ alongside iteration-1/ so you can track progress over time.

The numbers will get better. Keep going! 🚀