Measuring what harness augmentations actually add.

Rigorous, reproducible benchmarks across real software engineering tasks. No vibes, no cherry-picked demos — just pass rates.

No benchmark results yet

Run your first benchmark to populate the leaderboard.

bun run bench --suite swe-polybench-500 \
  --filter language:TypeScript \
  --variants claude-vanilla,claude-superpowers