Methodology

How AugmentBench measures harness augmentation value.

Overview

AugmentBench evaluates AI coding harnesses on realistic software engineering tasks drawn from open-source repositories. Each harness is tested under identical conditions: same base model, same task specification, same evaluation harness. The benchmark isolates the contribution of augmentation layers — planning prompts, tool scaffolding, memory systems — from the base model capability.

Harness Variants

Four harness configurations are benchmarked. Vanilla variants use the model with minimal system prompting and standard tool access. Augmented variants (Superpowers, OMO) layer on structured planning protocols, extended context management, and specialist tool routing. The base model is held constant across paired vanilla/augmented runs to isolate augmentation delta.

Task Design

Tasks are sourced from real GitHub repositories across four engineering personas: backend, full-stack, ML, and DevOps. Each task is an atomic unit of work with a verifiable acceptance criterion — passing automated tests, correct API contract, or valid infrastructure configuration. Tasks are tiered by estimated complexity: T1 (1-3 hour human estimate), T2 (half-day), T3 (multi-day). Tier assignments are validated via human expert review.

Metrics

Pass@1 measures the probability a single sample passes all acceptance criteria. Pass@3 estimates success given three attempts, using the unbiased estimator from Chen et al. (2021). Recovery rate measures successful task completion after an initial failed attempt with error feedback. Planning rate measures whether the harness produces a structured plan before execution. Efficiency metrics (turns, tokens, cost, time) reflect median values across successful runs.

Statistical Rigor

Each task-harness combination is evaluated over 50 independent runs with fresh context. Confidence intervals are computed using Wilson score intervals at 95% confidence. Statistical significance between harnesses is assessed via paired permutation tests. Results are reported only when the sample size is sufficient to achieve power ≥ 0.8 at a minimum detectable effect of 5 percentage points.

Isolation

To prevent contamination, tasks are sourced from repositories created after the model training cutoff. Evaluation is fully automated with no human in the loop during grading. Each run is executed in an isolated environment with deterministic seeds where applicable. Model temperature is held constant at 0 for reproducibility comparisons; sampling runs use temperature 1.0 for pass@k estimation.