Experimentation Lab

Research experiments in collaborative intelligence. We test hypotheses, measure outcomes, and publish findings.

Published Findings

REP-002 Published

Mandates vs Motivations

Click to preview

Mandates vs Motivations

Hypothesis

Does prescribing HOW help or hurt, compared to explaining WHY?

Findings

Status: Dual signal detected, framing clarified

Signal detection (n=10 per condition) revealed two different signals depending on how we measure outcomes:

Metric	Best Condition	Finding
LLM-as-judge quality	full-autonomy (0.891)	Autonomy produces verbose, well-documented code
Secure-by-construction	principle-guided (80%)	WHAT+WHY produces fundamentally safer implementations

Key insight: The original grader measured perceived quality (LLM preference for documentation, structure). The approach analysis measures engineering decisions (safe parser vs code execution).

Critical finding: Prescribing HOW (highly-structured) produces 2.5x worse security outcomes than explaining WHAT+WHY (principle-guided).

The Discriminating Task

safe-calculator: Implement a calculator that safely evaluates arithmetic expressions.

Why this discriminates: There are two fundamental approaches:

Safe parser: Build a lexer/parser, never execute arbitrary code
Code execution: Use code evaluation with restrictions (fundamentally insecure)

A restricted code evaluator passes runtime security tests but is insecure by construction — Python introspection bypasses any restriction.

Results

1. LLM-as-Judge Quality Scores

Condition	Weighted Quality	Judgment Score	Effect vs Principle-Guided
full-autonomy	0.891	2.40/3	d = +0.835 (LARGE)
highly-structured	0.809	2.10/3	d = +0.308 (SMALL)
principle-guided	0.748	1.80/3	— (reference)

Interpretation: LLM-as-judge favors verbose, well-documented solutions — even if they use code execution.

2. Implementation Approach Analysis

Condition	Pure Safe Parser	Uses Code Execution	Total	% Secure
principle-guided	8	2	10	80%
full-autonomy	7	3	10	70%
highly-structured	3	7	10	30%

Effect sizes (secure-by-construction rate):

principle-guided vs highly-structured: 80% vs 30% (Δ = +50%)
full-autonomy vs highly-structured: 70% vs 30% (Δ = +40%)
principle-guided vs full-autonomy: 80% vs 70% (Δ = +10%)

The Key Finding

The experiment worked. We were just measuring the wrong thing.

What We Measured	Result	The Problem
Perceived quality (LLM judge)	Autonomy wins	Judges documentation, not security
Engineering decision (approach)	Principles win	Catches the fundamental choice

When we measure secure-by-construction implementations:

WHAT + WHY (principle-guided) → 80% build safe parsers
Baseline (full-autonomy) → 70% build safe parsers
Prescribed HOW (highly-structured) → 30% build safe parsers

Prescribing HOW produces 2.5x worse security outcomes than explaining WHY.

Hypotheses Status

Hypothesis	Prediction	Quality Metric	Approach Metric
H1: WHAT+WHY > HOW	Principles beat steps	CONTRADICTED	SUPPORTED (80% vs 30%)
H2: WHAT+WHY > Baseline	Principles beat autonomy	CONTRADICTED	Supported (80% vs 70%)
H3: Baseline > HOW	Even baseline beats steps	Supported	SUPPORTED (70% vs 30%)

Limitations

Limitation	Impact	Mitigation
Single task	Results may be task-specific	Need more discriminating tasks
n=10 per condition	Moderate statistical power	n=60 for confirmation
Binary approach classification	Mixed category exists	Refined to pure_safe vs uses_execution

Next Steps

Run replication with updated grader — Use approach analysis as primary metric
Add deeper security tests — Test Python introspection bypasses
Cross-model validation — Test Haiku (showed opposite pattern in n=2 pilot)

REP-001 Published

Rigor is What You Want

Click to preview

Rigor is What You Want

Hypothesis

In the rush to adopt AI-powered tools and methodologies, are we measuring actual impact — or just following the hype?

Findings

Primary finding: Iteration improved success from 87% to 99% at 10x the token cost.

The insight: Iteration is insurance, not optimization. It only helps the 13% of tasks at the edge of capability — recovering 91% of those failures. For the 87% that succeed anyway, it just burns tokens.

The Core Pattern

Total problems:          164
Single-shot solved:      142 (87%)
Single-shot failed:       22 (13%)

Of those 22 failures:
  Iteration recovered:    20 (91%)
  Still failed:            2 (9%)

The Decision Framework

if (cost_of_failure > 10x_token_cost):
    use_iteration()
else:
    use_single_shot()

When to Use Each

Single-shot (87% success) — Use when:

Speed matters more than perfection
Token budget is constrained
Task is well within model capability
Occasional failure is acceptable

Iteration (99% success) — Use when:

Correctness is non-negotiable
Task has known edge cases or complexity
Cost of failure exceeds 10x token cost
Critical path or production code

What Iteration Can't Fix

Two problems (HumanEval/80, HumanEval/130) failed both strategies. This tells us:

Iteration is recovery, not capability expansion
There's a ceiling (98.8%, not 100%)
Some problems require reasoning the model can't perform
More iterations won't help beyond capability limits

Limitations

Single model: Haiku only (pattern may differ for Sonnet, Opus)
Single benchmark: HumanEval (coding tasks only)
Functional correctness: Didn't measure security, maintainability, or quality
Single run: No statistical variance analysis

Findings & Proposals

Mandates vs Motivations

Rigor is What You Want