📊 Reading Your Results
After skill-eval loop finishes, two files tell you everything you need to know: benchmark.json for the big picture, and grading.json for the detail. Here’s how to read them like a pro!
The headline: benchmark.json 🏆
This is your scoreboard. Open it with:
cat <skill-name>-workspace/iteration-1/benchmark.json | jq .
For a single-model run, you’ll see something like this:
{
"run_summary": {
"with_skill": {
"pass_rate": { "mean": 0.89, "stddev": 0.05 },
"time_seconds": { "mean": 18.4, "stddev": 2.1 },
"tokens": { "mean": 1250, "stddev": 120 }
},
"baseline": {
"pass_rate": { "mean": 0.56, "stddev": 0.08 },
"time_seconds": { "mean": 14.2, "stddev": 1.5 },
"tokens": { "mean": 980, "stddev": 90 }
},
"delta": {
"pass_rate": 0.33,
"time_seconds": 4.2,
"tokens": 270
}
},
"generated_at": "2026-06-25T16:30:00Z"
}
What each field means
| Field | What it’s telling you |
|---|---|
run_summary.with_skill |
Average stats across all evals when your skill was active |
run_summary.baseline |
Same stats, but without your skill (the control group) |
run_summary.delta |
The difference — this is the number you’re trying to maximise! |
pass_rate |
Fraction of assertions that passed (mean and stddev) |
time_seconds |
Average wall-clock time per eval |
tokens |
Average total tokens consumed per eval |
💡 A positive
pass_ratedelta is the goal. If it’s+0%or negative, your skill isn’t helping (or it’s actively hurting). Time to revisit your instructions!
Tracking progress across iterations 🔄
When a previous iteration exists, benchmark.json also includes a previous_iteration field and an iteration_delta:
{
"previous_iteration": 1,
"iteration_delta": {
"pass_rate": 0.05,
"time_seconds": 0.2,
"tokens": -15
}
}
iteration_delta is the current iteration’s delta minus the previous iteration’s delta. A positive pass_rate means your skill improved relative to baseline since the last run. A negative value means it got worse — a signal that a recent change may have hurt performance. Keep an eye on this to make sure you’re moving in the right direction! 📈
Per-eval breakdown 🔬
For the verdict on every individual eval, dig into the per-eval directories rather than benchmark.json. Each eval directory has its own grading.json:
cat <skill-name>-workspace/iteration-1/eval-1/with_skill/grading.json | jq .
Look for these patterns:
| Pattern | What it means |
|---|---|
| With-skill PASS, baseline FAIL | 🌟 Your skill turned a failure into a success — this is what you’re looking for! |
| Both PASS or both FAIL | Neither better nor worse — skill made no visible difference here. |
| With-skill FAIL, baseline PASS | ⚠️ Skill over-constrained the agent. Your instructions may be too prescriptive. |
The detail: grading.json 🧑⚖️
benchmark.json tells you that something failed. grading.json tells you why:
cat <skill-name>-workspace/iteration-1/grading.json | jq '.["eval-1"]'
{
"overall": "FAIL",
"assertions": [
{
"assertion": "Output includes a bar chart image",
"result": "PASS"
},
{
"assertion": "Chart shows exactly 3 months",
"result": "FAIL",
"reasoning": "Got 5 months — top-3 filter was not applied"
},
{
"assertion": "Both axes are labeled",
"result": "PASS"
}
]
}
The reasoning field is gold. It’s the judge’s explanation for the verdict — read it carefully before updating your skill instructions or your feedback.
Common patterns to look for 🧩
All evals fail, with and without skill Your assertions might be too strict, or your eval prompts might need more context. Try relaxing one assertion at a time to isolate the issue.
Baseline always passes, with-skill always fails
Your skill instructions are constraining the agent in a way that’s breaking something. Look for overly prescriptive steps in your SKILL.md.
With-skill passes, baseline fails (the dream!) This is exactly the signal you want. Note which instructions drove the improvement — that’s your skill earning its keep.
Skill is slower but passes more A timing trade-off. Usually fine! But if the extra time is significant, check whether your skill is prompting extra unnecessary steps.
Iterating from here 🔄
Once you’ve read your results:
- Add feedback for any failing evals — see Giving Feedback for how to write notes that actually help.
- Tweak your skill instructions based on what you learned from the
reasoningfields. - Run again to compare:
skill-eval loop --baseline previouscreates aniteration-2/alongsideiteration-1/so you can track progress over time.
The numbers will get better. Keep going! 🚀