Status: Dual signal detected, framing clarified
Signal detection (n=10 per condition) revealed two different signals depending on how we measure outcomes:
| Metric |
Best Condition |
Finding |
| LLM-as-judge quality |
full-autonomy (0.891) |
Autonomy produces verbose, well-documented code |
| Secure-by-construction |
principle-guided (80%) |
WHAT+WHY produces fundamentally safer implementations |
Key insight: The original grader measured perceived quality (LLM preference for documentation, structure). The approach analysis measures engineering decisions (safe parser vs code execution).
Critical finding: Prescribing HOW (highly-structured) produces 2.5x worse security outcomes than explaining WHAT+WHY (principle-guided).
The Discriminating Task
safe-calculator: Implement a calculator that safely evaluates arithmetic expressions.
Why this discriminates: There are two fundamental approaches:
- Safe parser: Build a lexer/parser, never execute arbitrary code
- Code execution: Use code evaluation with restrictions (fundamentally insecure)
A restricted code evaluator passes runtime security tests but is insecure by construction — Python introspection bypasses any restriction.
Results
1. LLM-as-Judge Quality Scores
| Condition |
Weighted Quality |
Judgment Score |
Effect vs Principle-Guided |
| full-autonomy |
0.891 |
2.40/3 |
d = +0.835 (LARGE) |
| highly-structured |
0.809 |
2.10/3 |
d = +0.308 (SMALL) |
| principle-guided |
0.748 |
1.80/3 |
— (reference) |
Interpretation: LLM-as-judge favors verbose, well-documented solutions — even if they use code execution.
2. Implementation Approach Analysis
| Condition |
Pure Safe Parser |
Uses Code Execution |
Total |
% Secure |
| principle-guided |
8 |
2 |
10 |
80% |
| full-autonomy |
7 |
3 |
10 |
70% |
| highly-structured |
3 |
7 |
10 |
30% |
Effect sizes (secure-by-construction rate):
- principle-guided vs highly-structured: 80% vs 30% (Δ = +50%)
- full-autonomy vs highly-structured: 70% vs 30% (Δ = +40%)
- principle-guided vs full-autonomy: 80% vs 70% (Δ = +10%)
The Key Finding
The experiment worked. We were just measuring the wrong thing.
| What We Measured |
Result |
The Problem |
| Perceived quality (LLM judge) |
Autonomy wins |
Judges documentation, not security |
| Engineering decision (approach) |
Principles win |
Catches the fundamental choice |
When we measure secure-by-construction implementations:
- WHAT + WHY (principle-guided) → 80% build safe parsers
- Baseline (full-autonomy) → 70% build safe parsers
- Prescribed HOW (highly-structured) → 30% build safe parsers
Prescribing HOW produces 2.5x worse security outcomes than explaining WHY.
Hypotheses Status
| Hypothesis |
Prediction |
Quality Metric |
Approach Metric |
| H1: WHAT+WHY > HOW |
Principles beat steps |
CONTRADICTED |
SUPPORTED (80% vs 30%) |
| H2: WHAT+WHY > Baseline |
Principles beat autonomy |
CONTRADICTED |
Supported (80% vs 70%) |
| H3: Baseline > HOW |
Even baseline beats steps |
Supported |
SUPPORTED (70% vs 30%) |
Limitations
| Limitation |
Impact |
Mitigation |
| Single task |
Results may be task-specific |
Need more discriminating tasks |
| n=10 per condition |
Moderate statistical power |
n=60 for confirmation |
| Binary approach classification |
Mixed category exists |
Refined to pure_safe vs uses_execution |
Next Steps
- Run replication with updated grader — Use approach analysis as primary metric
- Add deeper security tests — Test Python introspection bypasses
- Cross-model validation — Test Haiku (showed opposite pattern in n=2 pilot)