🔬 Cross-Model Benchmarking
Your skill works great in pi. But does it help claude? What about copilot? If you’re building a skill meant to be runtime-agnostic, you need to know it pulls its weight everywhere — including across the models each agent supports.
Cross-model benchmarking runs every eval against every agent you specify — in one command. You get a clean per-model breakdown so you can spot exactly where to focus your tuning.
How it works 🎯
Pass --models with a comma-separated list of agent:model pairs:
skill-eval loop --models pi:claude-sonnet,claude:opus-4-8,copilot:gpt-5
skill-eval runs the full run → grade → benchmark cycle for each model. Each one gets its own with-skill and baseline run, and the results land side-by-side in benchmark.json.
What you get 📊
The benchmark tells you at a glance who’s winning:
{
"models": {
"pi-claude-sonnet": { "delta": { "pass_rate": 0.33 } },
"claude-opus-4-8": { "delta": { "pass_rate": 0.17 } },
"copilot-gpt-5": { "delta": { "pass_rate": 0.33 } }
},
"best_model": "pi-claude-sonnet",
"worst_model": "claude-opus-4-8"
}
Three signals jump out:
best_model/worst_model— no scanning required, skill-eval tells you- Per-model delta — is your skill helping pi more than claude? You’ll see it
- The gap — if one model’s delta is half the others, that’s where to tune your skill next
The iteration loop 🔄
Here’s the real power move. Run with --models across iterations:
Run 1: pi-claude-sonnet +33%, claude-opus-4-8 +17%, copilot-gpt-5 +33%
→ Claude's weak at chart-related assertions. Tweak your SKILL.md.
Run 2: pi-claude-sonnet +34%, claude-opus-4-8 +28%, copilot-gpt-5 +32%
→ Claude improved! Copilot slipped slightly. Another tweak.
Run 3: pi-claude-sonnet +34%, claude-opus-4-8 +31%, copilot-gpt-5 +36%
→ All above 30%. Converging nicely. Ship it! 🚀
Over time, you’ll spot patterns in how different models interpret your skill instructions — and learn to write skills that work well universally.
Composing with --fix 🪄
Cross-model and auto-fix work together seamlessly:
skill-eval loop --models pi:claude-sonnet,claude:opus-4-8 --fix --max-fix-attempts 2
Each model gets its own independent fix phase. If pi passes on the first attempt but claude needs two fix rounds before converging, you’ll see that in the output — and in the benchmark. That tells you claude’s baseline needs more attention than pi’s.
Config for persistence 🛠️
Tired of typing the same --models every time? Set it once in your config:
# .skill-eval.yaml
models:
- agent: pi
model: claude-sonnet-4-5
- agent: claude
model: opus-4-8
- agent: copilot
model: gpt-5
Config models become the default for every run, grade, and loop command. CLI --models always wins when both are present.
⚠️ Cost warning! skill-eval counts your total agent invocations ahead of time. If you’re about to launch 18 runs (3 evals × 3 models × 2 configs), it’ll ask for confirmation first. No surprise bills!
What’s next?
- Auto-Fixing — Combine cross-model with auto-refinement for the ultimate iteration loop.
- Reading Your Results — Make sense of all those deltas and per-eval breakdowns.
- Eval Workflow — The full picture of how everything connects.