🛠️ Handy Subcommands

Command	What it does
`init`	Sets up your `evals/evals.json` and workspace. Add `--global` for global config.
`run`	Executes all evals. Use `--eval <id>` for just one, `--baseline previous` to snapshot, or `--resume` to pick up where you left off.
`grade`	Asks the LLM to grade your assertions. Add `--benchmark` to auto-aggregate the stats.
`benchmark`	Wraps up all your grading results into a neat `benchmark.json`.
`loop`	Does it all: run → grade → benchmark! Add `--fix` to auto-refine, `--models` to compare agents.

The `--model` flag 🔬

Want to know if your skill helps every agent, not just your default? Pass --models with a comma-separated list of agent:model pairs and skill-eval will run every eval against each one:

skill-eval loop --models pi:claude-sonnet,claude,copilot

Each model gets its own with-skill + baseline run, producing per-model stats in benchmark.json with a best_model and worst_model. Spot which agent your skill helps most — and which one needs work! 🔍

💡 Runs in batches of 2 to keep things snappy without hammering APIs. If you’ve got lots of evals × models, skill-eval warns you before firing off a barrage of agent invocations.

🛠️ Tired of typing it every time? Drop models: into your .skill-eval.yaml and --models becomes the default.

# .skill-eval.yaml
models:
  - agent: pi
    model: claude-sonnet-4-5
  - agent: claude
  - agent: copilot

The `--fix` flag 🪄

Tack --fix onto loop and skill-eval will automatically re-run any failing with-skill eval — feeding the judge’s feedback back to the agent as a critique. It’ll keep refining until every assertion passes, the score stops improving, or it hits the attempt limit.

skill-eval loop --fix                   # default: up to 3 fix attempts per eval
skill-eval loop --fix --max-fix-attempts 5   # crank it up if you're feeling ambitious!

Each fix attempt lands in fix-N/ inside the eval directory, with its own grading and timing. The best attempt wins and gets promoted to the main grading.json. If the same assertions fail twice in a row, it stops early — no point burning tokens on a plateau! 🏔️

Pick up where you left off with `--resume` 🔄

Long runs sometimes get interrupted. skill-eval run writes a progress lockfile at <workspace>/iteration-N/.lock.json, so you can resume the latest unfinished iteration instead of starting over:

skill-eval run --resume
skill-eval loop --resume

If the latest iteration is already complete, --resume tells you there’s nothing to pick up. And skill-eval grade will refuse to grade an incomplete iteration — finish it with --resume first!

Debug with `--verbose` 🐛

When something goes wrong, --verbose (or -v) prints the agent commands, durations, and raw output to stderr. It’s super handy for CI logs or when an agent fails mysteriously:

skill-eval run --verbose
skill-eval -v loop --models pi:claude-sonnet,claude

🔒 Verbose mode never prints secrets or the contents of your input files — just operational details.

Mix in deterministic matchers 🤖

For quick, repeatable checks, prefix assertions with file_exists:, contains_text:, or matches_text:. They run locally instead of burning tokens on the judge. See the Eval Workflow guide for examples!

🛠️ Handy Subcommands

The --model flag 🔬

The --fix flag 🪄

Pick up where you left off with --resume 🔄

Debug with --verbose 🐛

Mix in deterministic matchers 🤖

The `--model` flag 🔬

The `--fix` flag 🪄

Pick up where you left off with `--resume` 🔄

Debug with `--verbose` 🐛