REP-002 — Lab — claude-1337

Summary

Test whether explaining WHY (motivation) produces different outcomes than prescribing HOW (mandate) when working with Claude. The hypothesis: Claude is Constitutional AI — trained with values, not rigid rules — and may respond better to motivation than mandate.

Motivation

REP-001 established that methodology effectiveness is measurable. This REP applies that capability to a foundational question: how should we interact with AI?

The Mental Model Gap

Two competing mental models for AI interaction:

Model	Assumption	Interaction Style
Algorithm runner	AI executes instructions	Prescribe HOW in detail
Reasoning agent	AI exercises judgment	Explain WHY, let it decide HOW

The algorithm runner model shaped early prompt engineering: more detail = more control = better outcomes. This led to frameworks prescribing exact artifacts, file structures, and personas.

The Constitutional AI Insight

Claude is trained differently. Constitutional AI (Anthropic 2022) uses values-based training:

"Values-based training produces understanding and judgment. Rule-based constraints produce compliance and brittleness."

This suggests a different interaction model might be effective:

Explain WHY, not just WHAT
Principles that generalize, not specs that enumerate
Agency retained, not removed

The Uncertainty

We don't know which produces better outcomes. Both have plausible arguments:

For mandate (prescribe HOW):

Removes ambiguity
Consistent structure
Explicit expectations

For motivation (explain WHY):

Enables judgment
Handles edge cases
Adapts to context
Aligns with Constitutional AI design

The discourse is tribal. This is testable.

The Question

Does prescribing HOW help or hurt, compared to explaining WHY?

Experiment Design

Isolating the Variable

The key insight: all conditions share the same WHAT + WHY + CONSTRAINTS. The only difference is whether/how HOW is prescribed.

This isolates the variable being tested.

Conditions

Condition	Type	What Claude Receives
`baseline`	Control	WHAT only — pure Claude, no methodology
`motivation`	Principles	WHAT + WHY + CONSTRAINTS (Claude decides HOW)
`mandate-template`	Mandate	Above + HOW via required template artifacts
`mandate-structure`	Mandate	Above + HOW via required file structure
`mandate-role`	Mandate	Above + HOW via prescribed expert persona

Why These Specific Mandates?

Each tests a different flavor of "prescribing HOW":

Mandate Type	Example Pattern	Tests Whether...
Template	"Produce these artifacts: spec.md, plan.md, impl.md"	Required outputs help or constrain
Structure	"Use this file layout: PROJECT/, PLAN/, STATE/"	Imposed organization helps or constrains
Role	"You are a senior architect who always..."	Prescribed persona helps or constrains

Metrics

Metric	What it Measures
`pass@k`	Task completion in k attempts
`recovery_rate`	Ability to fix own mistakes
`tokens_used`	Efficiency (cost)

Benchmark

SWE-bench lite subset — real-world software engineering tasks with ground truth via test suites.

Hypotheses

H0 (null): No difference between approaches.

H1: Mandate outperforms motivation.

Explicit structure reduces errors
Prescribed process helps task completion

H2: Motivation outperforms mandate.

Agency enables judgment
Principles handle varied contexts
Constitutional AI responds to reasoning

H3: Effect is task-dependent.

Mandate wins on structured tasks
Motivation wins on judgment tasks

H4: Baseline wins.

Any methodology overhead hurts outcomes
Keep it simple

All outcomes are valuable findings.

What Would Falsify Each Hypothesis

Hypothesis	Falsified By
H0 (no difference)	Statistically significant difference on any metric
H1 (mandate wins)	Motivation outperforms on pass@k or recovery_rate
H2 (motivation wins)	Any mandate outperforms on pass@k or recovery_rate
H3 (task-dependent)	One approach wins across all task types
H4 (baseline wins)	Any methodology beats baseline significantly

Findings (Interim)

Status: Dual signal detected, framing clarified

Signal detection (n=10 per condition) revealed two different signals depending on how we measure outcomes:

Metric	Best Condition	Finding
LLM-as-judge quality	full-autonomy (0.891)	Autonomy produces verbose, well-documented code
Secure-by-construction	principle-guided (80%)	WHAT+WHY produces fundamentally safer implementations

Key insight: The original grader measured perceived quality (LLM preference for documentation, structure). The approach analysis measures engineering decisions (safe parser vs code execution).

Critical finding: Prescribing HOW (highly-structured) produces 2.5x worse security outcomes than explaining WHAT+WHY (principle-guided).

The Discriminating Task

safe-calculator: Implement a calculator that safely evaluates arithmetic expressions.

Why this discriminates: There are two fundamental approaches:

Safe parser: Build a lexer/parser, never execute arbitrary code
Code execution: Use code evaluation with restrictions (fundamentally insecure)

A restricted code evaluator passes runtime security tests but is insecure by construction — Python introspection bypasses any restriction.

Results

1. LLM-as-Judge Quality Scores

Condition	Weighted Quality	Judgment Score	Effect vs Principle-Guided
full-autonomy	0.891	2.40/3	d = +0.835 (LARGE)
highly-structured	0.809	2.10/3	d = +0.308 (SMALL)
principle-guided	0.748	1.80/3	— (reference)

Interpretation: LLM-as-judge favors verbose, well-documented solutions — even if they use code execution.

2. Implementation Approach Analysis

Condition	Pure Safe Parser	Uses Code Execution	Total	% Secure
principle-guided	8	2	10	80%
full-autonomy	7	3	10	70%
highly-structured	3	7	10	30%

Effect sizes (secure-by-construction rate):

principle-guided vs highly-structured: 80% vs 30% (Δ = +50%)
full-autonomy vs highly-structured: 70% vs 30% (Δ = +40%)
principle-guided vs full-autonomy: 80% vs 70% (Δ = +10%)

The Key Finding

The experiment worked. We were just measuring the wrong thing.

What We Measured	Result	The Problem
Perceived quality (LLM judge)	Autonomy wins	Judges documentation, not security
Engineering decision (approach)	Principles win	Catches the fundamental choice

When we measure secure-by-construction implementations:

WHAT + WHY (principle-guided) → 80% build safe parsers
Baseline (full-autonomy) → 70% build safe parsers
Prescribed HOW (highly-structured) → 30% build safe parsers

Prescribing HOW produces 2.5x worse security outcomes than explaining WHY.

Hypotheses Status

Hypothesis	Prediction	Quality Metric	Approach Metric
H1: WHAT+WHY > HOW	Principles beat steps	CONTRADICTED	SUPPORTED (80% vs 30%)
H2: WHAT+WHY > Baseline	Principles beat autonomy	CONTRADICTED	Supported (80% vs 70%)
H3: Baseline > HOW	Even baseline beats steps	Supported	SUPPORTED (70% vs 30%)

Limitations

Limitation	Impact	Mitigation
Single task	Results may be task-specific	Need more discriminating tasks
n=10 per condition	Moderate statistical power	n=60 for confirmation
Binary approach classification	Mixed category exists	Refined to pure_safe vs uses_execution

Next Steps

Run replication with updated grader — Use approach analysis as primary metric
Add deeper security tests — Test Python introspection bypasses
Cross-model validation — Test Haiku (showed opposite pattern in n=2 pilot)

Prior Art

Work	Relevance
Constitutional AI (Anthropic 2022)	Values vs rules in AI training
Blaurock et al. 2024	Transparency + control produce complementary outcomes (β = 0.415, 0.507)
Scott Spence eval	Forced language doesn't improve activation beyond threshold
REP-001	Methodology measurement is possible

Gap This Fills

No controlled experiments compare:

Motivation-based vs mandate-based prompting
The effect of prescribing HOW vs explaining WHY
Whether Constitutional AI training affects optimal interaction style

Open Questions

Fair representation: How to crystallize each condition as a prompt without strawmanning?
Shared context: What WHAT + WHY + CONSTRAINTS should all conditions share?
Task selection: Which SWE-bench tasks best differentiate the approaches?
Model dependency: Results may vary by model. Start with Sonnet, extend if findings warrant.

Why This Matters

The agentic era is scaling fast. Teams adopt interaction patterns based on intuition and tribal preference. Evidence would help.

If motivation wins: Validates the Constitutional AI interaction model. Stop prescribing HOW; explain WHY.

If mandate wins: Structure helps even reasoning agents. Invest in process specification.

If task-dependent: Provides decision framework for when to use which.

If baseline wins: Framework overhead hurts. Keep interaction minimal.

References

Anthropic. "Constitutional AI: Harmlessness from AI Feedback" (2022)
Blaurock, M. et al. "AI-Based Service Experience Contingencies" Journal of Service Research (2024)
Feynman, R. "Cargo Cult Science" (1974)