Measuring what harness augmentations actually add.
Rigorous, reproducible benchmarks across real software engineering tasks. No vibes, no cherry-picked demos — just pass rates.
No benchmark results yet
Run your first benchmark to populate the leaderboard.
bun run bench --suite swe-polybench-500 \
--filter language:TypeScript \
--variants claude-vanilla,claude-superpowers