Rigor is What You Want
You just don't know it yet
Summary
The AI boom has created a rush to adopt tools and methodologies without measuring actual impact. 63% of IT leaders cite FOMO as their primary driver for AI investment (ABBYY 2024), yet 74% of companies can't demonstrate tangible value (BCG 2024). Rigorous evaluation breaks this cycle by measuring actual impact and enabling data-driven decisions.
The Question
In the rush to adopt AI-powered tools and methodologies, are we measuring actual impact — or just following the hype?
Motivation
The Hype-Driven Adoption Problem
The current AI boom shows a consistent pattern: organizations adopt tools based on hype, not evidence.
The perception-reality gap: A 2025 randomized controlled trial by METR found that experienced open-source developers using AI tools (Cursor Pro, Claude 3.5/3.7) were actually 19% slower at completing tasks — despite predicting a 24% speedup beforehand (Becker et al., 2025). Developers spent more time reviewing AI outputs, prompting, and waiting than they saved on coding.
The aggregate evidence: A California Management Review meta-analysis pooling 371 estimates (2019-2024) found no robust relationship between AI adoption and productivity once publication bias and methodological issues were controlled (CMR, 2025).
The FOMO effect: Despite this evidence gap, adoption continues. ABBYY's 2024 survey found 63% of IT leaders cite fear of being "left behind" as a driver of AI investment. Organizations invested an average of $879,000 in AI tools while only 4% reported meaningful innovation outcomes.
The Homogenization Risk
When everyone adopts the same tools without measuring actual outcomes, the field converges on whatever the current hype promotes — regardless of whether it works.
Zhang et al. (2025) found that LLM-generated content is systematically more homogeneous than human-generated content, masking underlying variation. The same pattern applies to methodology: when teams follow popular advice instead of measuring impact in their own context, the entire industry homogenizes around unverified practices.
Rigorous evaluation breaks this cycle. Different teams measuring actual outcomes will discover different optimal approaches based on their context — not what a vendor claimed or what a thought leader tweeted.
Why This Matters
Decisions compound. Bad methodology choices bake into everything built on them. When an industry collectively adopts based on hype rather than evidence:
- Suboptimal practices become "best practices"
- Contrary evidence gets dismissed as anecdotal
- The cost of being wrong scales with adoption
The Hypothesis
If we replace hype-driven adoption with rigorous evaluation, we can make methodology decisions based on measured impact rather than market pressure.
We need a way to produce hard data about what actually works — not what's popular.
Guide-level explanation
When facing a methodology question ("Does X improve outcomes?"), follow this process:
- Hypothesis — State a clear, falsifiable claim
- Experiment — Design controlled conditions with ground truth
- Metrics — Define what we measure and how
- Analysis — Let the data speak
- Decision — Act on evidence, not preference
This LEP validates the approach by testing a simple question: Does iteration improve outcomes?
Reference-level explanation
Experiment Design
Question: Does iteration (Ralph-style self-review) improve code generation outcomes?
Benchmark: HumanEval (164 Python coding problems)
- Ground truth via test suites
- Varied difficulty
- Standard, comparable benchmark
Conditions:
| Strategy | Description |
|---|---|
| Single-shot | One attempt, submit result |
| Ralph-style | Up to 3 iterations with test feedback |
Metrics:
- Pass rate (binary correctness)
- Token consumption
- Iterations used
Results (Full HumanEval, 164 problems, Haiku)
| Strategy | Pass Rate | Avg Tokens | Multiplier |
|---|---|---|---|
| Single-shot | 86.6% (142/164) | 232 | 1x |
| Ralph-style | 98.8% (162/164) | 2,264 | ~10x |
Iteration recovered 20 of 22 failures. Two problems remained unsolved — the model's capability ceiling. Pattern consistent across problem difficulty ranges.
Findings
Primary finding: Iteration improved success from 87% to 99% at 10x the token cost.
The insight: Iteration is insurance, not optimization. It only helps the 13% of tasks at the edge of capability — recovering 91% of those failures. For the 87% that succeed anyway, it just burns tokens.
The Core Pattern
Total problems: 164
Single-shot solved: 142 (87%)
Single-shot failed: 22 (13%)
Of those 22 failures:
Iteration recovered: 20 (91%)
Still failed: 2 (9%)
The Decision Framework
if (cost_of_failure > 10x_token_cost):
use_iteration()
else:
use_single_shot()
When to Use Each
Single-shot (87% success) — Use when:
- Speed matters more than perfection
- Token budget is constrained
- Task is well within model capability
- Occasional failure is acceptable
Iteration (99% success) — Use when:
- Correctness is non-negotiable
- Task has known edge cases or complexity
- Cost of failure exceeds 10x token cost
- Critical path or production code
What Iteration Can't Fix
Two problems (HumanEval/80, HumanEval/130) failed both strategies. This tells us:
- Iteration is recovery, not capability expansion
- There's a ceiling (98.8%, not 100%)
- Some problems require reasoning the model can't perform
- More iterations won't help beyond capability limits
Limitations
- Single model: Haiku only (pattern may differ for Sonnet, Opus)
- Single benchmark: HumanEval (coding tasks only)
- Functional correctness: Didn't measure security, maintainability, or quality
- Single run: No statistical variance analysis
Drawbacks
- Cost: Running experiments costs API tokens and time
- Scope: Results may not generalize beyond the tested benchmark
- Complexity: Requires infrastructure for reproducible experiments
Rationale and alternatives
Why this approach?
- Scientific method is proven over centuries
- Falsifiable hypotheses prevent self-deception
- Quantitative data enables objective comparison
Alternatives considered:
- Expert opinion — Subject to bias, not reproducible
- Case studies — Anecdotal, not systematic
- A/B testing in production — Expensive, slow feedback
Impact of not doing this: Methodology debates remain tribal. Decisions made on vibes compound into scaled mistakes.
Prior art
Research on AI Hype and Adoption
| Work | Finding |
|---|---|
| METR RCT (Becker et al., 2025) | Experienced developers 19% slower with AI tools despite expecting 24% speedup. arXiv:2507.09089 |
| CMR Meta-Analysis (2025) | No robust AI-productivity relationship in 371 pooled estimates once publication bias controlled. CMR |
| ABBYY AI Trust Barometer (2024) | 63% of IT leaders cite FOMO as adoption driver; avg $879K invested despite uncertain ROI. ABBYY |
| BCG AI Survey (2024) | 74% of companies unable to demonstrate tangible value from AI investments. BCG |
| Zhang et al. (2025) | LLM outputs systematically more homogeneous than human outputs, masking variation. SAGE |
Benchmarks and Methodology
| Work | Relevance |
|---|---|
| HumanEval (OpenAI) | Code generation benchmark we build on |
| SWE-bench (Princeton) | Real-world agent evaluation |
| LMSYS Arena | Model comparison methodology |
| Rust RFCs | Proposal process we adapt |
Unresolved questions
- Does the pattern hold for other models (Sonnet, Opus)?
- Does it hold for non-coding tasks?
- What's the optimal iteration limit before diminishing returns?
Future possibilities
With validated measurement infrastructure:
- REP-002: Spec-driven frameworks vs baseline
- REP-003: Prompt engineering patterns
- REP-004: Agent architectures (ReAct, CoT, etc.)
- REP-005: Multi-agent coordination patterns
Each builds on proven methodology.