SE
Skill Eval

πŸ› οΈ Handy Subcommands

Command What it does
init Sets up your evals/evals.json and workspace. Add --global for global config.
run Executes all evals. Use --eval <id> for just one, --baseline previous to snapshot, or --resume to pick up where you left off.
grade Asks the LLM to grade your assertions. Add --benchmark to auto-aggregate the stats.
benchmark Wraps up all your grading results into a neat benchmark.json.
loop Does it all: run β†’ grade β†’ benchmark! Add --fix to auto-refine, --models to compare agents.

The --model flag πŸ”¬

Want to know if your skill helps every agent, not just your default? Pass --models with a comma-separated list of agent:model pairs and skill-eval will run every eval against each one:

skill-eval loop --models pi:claude-sonnet,claude,copilot

Each model gets its own with-skill + baseline run, producing per-model stats in benchmark.json with a best_model and worst_model. Spot which agent your skill helps most β€” and which one needs work! πŸ”

πŸ’‘ Runs in batches of 2 to keep things snappy without hammering APIs. If you’ve got lots of evals Γ— models, skill-eval warns you before firing off a barrage of agent invocations.

πŸ› οΈ Tired of typing it every time? Drop models: into your .skill-eval.yaml and --models becomes the default.

# .skill-eval.yaml
models:
  - agent: pi
    model: claude-sonnet-4-5
  - agent: claude
  - agent: copilot

The --fix flag πŸͺ„

Tack --fix onto loop and skill-eval will automatically re-run any failing with-skill eval β€” feeding the judge’s feedback back to the agent as a critique. It’ll keep refining until every assertion passes, the score stops improving, or it hits the attempt limit.

skill-eval loop --fix                   # default: up to 3 fix attempts per eval
skill-eval loop --fix --max-fix-attempts 5   # crank it up if you're feeling ambitious!

Each fix attempt lands in fix-N/ inside the eval directory, with its own grading and timing. The best attempt wins and gets promoted to the main grading.json. If the same assertions fail twice in a row, it stops early β€” no point burning tokens on a plateau! πŸ”οΈ

Pick up where you left off with --resume πŸ”„

Long runs sometimes get interrupted. skill-eval run writes a progress lockfile at <workspace>/iteration-N/.lock.json, so you can resume the latest unfinished iteration instead of starting over:

skill-eval run --resume
skill-eval loop --resume

If the latest iteration is already complete, --resume tells you there’s nothing to pick up. And skill-eval grade will refuse to grade an incomplete iteration β€” finish it with --resume first!

Debug with --verbose πŸ›

When something goes wrong, --verbose (or -v) prints the agent commands, durations, and raw output to stderr. It’s super handy for CI logs or when an agent fails mysteriously:

skill-eval run --verbose
skill-eval -v loop --models pi:claude-sonnet,claude

πŸ”’ Verbose mode never prints secrets or the contents of your input files β€” just operational details.

Mix in deterministic matchers πŸ€–

For quick, repeatable checks, prefix assertions with file_exists:, contains_text:, or matches_text:. They run locally instead of burning tokens on the judge. See the Eval Workflow guide for examples!