Experimentation Lab

Research experiments in collaborative intelligence. We test hypotheses, measure outcomes, and publish findings.

Findings & Proposals

2 entries

Published Findings

2
REP-002 Published

Mandates vs Motivations

Click to preview
REP-002 Published

Mandates vs Motivations

Does prescribing HOW help or hurt, compared to explaining WHY?

Status: Dual signal detected, framing clarified

Signal detection (n=10 per condition) revealed two different signals depending on how we measure outcomes:

Metric Best Condition Finding
LLM-as-judge quality full-autonomy (0.891) Autonomy produces verbose, well-documented code
Secure-by-construction principle-guided (80%) WHAT+WHY produces fundamentally safer implementations

Key insight: The original grader measured perceived quality (LLM preference for documentation, structure). The approach analysis measures engineering decisions (safe parser vs code execution).

Critical finding: Prescribing HOW (highly-structured) produces 2.5x worse security outcomes than explaining WHAT+WHY (principle-guided).

The Discriminating Task

safe-calculator: Implement a calculator that safely evaluates arithmetic expressions.

Why this discriminates: There are two fundamental approaches:

  1. Safe parser: Build a lexer/parser, never execute arbitrary code
  2. Code execution: Use code evaluation with restrictions (fundamentally insecure)

A restricted code evaluator passes runtime security tests but is insecure by construction — Python introspection bypasses any restriction.

Results

1. LLM-as-Judge Quality Scores

Condition Weighted Quality Judgment Score Effect vs Principle-Guided
full-autonomy 0.891 2.40/3 d = +0.835 (LARGE)
highly-structured 0.809 2.10/3 d = +0.308 (SMALL)
principle-guided 0.748 1.80/3 — (reference)

Interpretation: LLM-as-judge favors verbose, well-documented solutions — even if they use code execution.

2. Implementation Approach Analysis

Condition Pure Safe Parser Uses Code Execution Total % Secure
principle-guided 8 2 10 80%
full-autonomy 7 3 10 70%
highly-structured 3 7 10 30%

Effect sizes (secure-by-construction rate):

  • principle-guided vs highly-structured: 80% vs 30% (Δ = +50%)
  • full-autonomy vs highly-structured: 70% vs 30% (Δ = +40%)
  • principle-guided vs full-autonomy: 80% vs 70% (Δ = +10%)

The Key Finding

The experiment worked. We were just measuring the wrong thing.

What We Measured Result The Problem
Perceived quality (LLM judge) Autonomy wins Judges documentation, not security
Engineering decision (approach) Principles win Catches the fundamental choice

When we measure secure-by-construction implementations:

  • WHAT + WHY (principle-guided) → 80% build safe parsers
  • Baseline (full-autonomy) → 70% build safe parsers
  • Prescribed HOW (highly-structured) → 30% build safe parsers

Prescribing HOW produces 2.5x worse security outcomes than explaining WHY.

Hypotheses Status

Hypothesis Prediction Quality Metric Approach Metric
H1: WHAT+WHY > HOW Principles beat steps CONTRADICTED SUPPORTED (80% vs 30%)
H2: WHAT+WHY > Baseline Principles beat autonomy CONTRADICTED Supported (80% vs 70%)
H3: Baseline > HOW Even baseline beats steps Supported SUPPORTED (70% vs 30%)

Limitations

Limitation Impact Mitigation
Single task Results may be task-specific Need more discriminating tasks
n=10 per condition Moderate statistical power n=60 for confirmation
Binary approach classification Mixed category exists Refined to pure_safe vs uses_execution

Next Steps

  1. Run replication with updated grader — Use approach analysis as primary metric
  2. Add deeper security tests — Test Python introspection bypasses
  3. Cross-model validation — Test Haiku (showed opposite pattern in n=2 pilot)
REP-001 Published

Rigor is What You Want

Click to preview
REP-001 Published

Rigor is What You Want

In the rush to adopt AI-powered tools and methodologies, are we measuring actual impact — or just following the hype?

Primary finding: Iteration improved success from 87% to 99% at 10x the token cost.

The insight: Iteration is insurance, not optimization. It only helps the 13% of tasks at the edge of capability — recovering 91% of those failures. For the 87% that succeed anyway, it just burns tokens.

The Core Pattern

Total problems:          164
Single-shot solved:      142 (87%)
Single-shot failed:       22 (13%)

Of those 22 failures:
  Iteration recovered:    20 (91%)
  Still failed:            2 (9%)

The Decision Framework

if (cost_of_failure > 10x_token_cost):
    use_iteration()
else:
    use_single_shot()

When to Use Each

Single-shot (87% success) — Use when:

  • Speed matters more than perfection
  • Token budget is constrained
  • Task is well within model capability
  • Occasional failure is acceptable

Iteration (99% success) — Use when:

  • Correctness is non-negotiable
  • Task has known edge cases or complexity
  • Cost of failure exceeds 10x token cost
  • Critical path or production code

What Iteration Can't Fix

Two problems (HumanEval/80, HumanEval/130) failed both strategies. This tells us:

  • Iteration is recovery, not capability expansion
  • There's a ceiling (98.8%, not 100%)
  • Some problems require reasoning the model can't perform
  • More iterations won't help beyond capability limits

Limitations

  • Single model: Haiku only (pattern may differ for Sonnet, Opus)
  • Single benchmark: HumanEval (coding tasks only)
  • Functional correctness: Didn't measure security, maintainability, or quality
  • Single run: No statistical variance analysis