Rep 001

LEP-001: Rigor is What You Want

From LEP-001-Evals are non negotiable | IMP-001

Hypothesis

Primary: Does iteration (Ralph-style self-review) improve outcomes?

Secondary: Can methodology effectiveness be measured with hard data?

Experiment Design

Conditions

Condition	Description
`single`	One API call, no iteration
`ralph-3`	Up to 3 iterations with self-review
`ralph-5`	Up to 5 iterations with self-review

Task: Interval Merging

Merge overlapping intervals - a LeetCode medium problem with many edge cases:

Overlapping intervals
Adjacent intervals (e.g., [1,2] and [2,3])
Unsorted input
Negative numbers
Empty/single element cases

Why this task: Single-shot often misses edge cases that review would catch.

Metrics

Metric	Type	Description
`correctness`	binary	Passes all 20 test cases
`iterations_used`	count	How many iterations until completion
`tokens_total`	count	Total tokens consumed
`success_rate`	ratio	Passes / total runs per condition

Observability

OTel instrumentation included for transparency:

experiment_run span per condition/run
llm_call span per API call
evaluation span for ground truth verification

Run

# From experiment directory
uv run python -m lep_001_rigor_is_what_you_want

# Options
uv run python -m lep_001_rigor_is_what_you_want --runs 3      # Pilot (3 runs per condition)
uv run python -m lep_001_rigor_is_what_you_want --runs 5      # Full experiment
uv run python -m lep_001_rigor_is_what_you_want --dry-run     # Show config only
uv run python -m lep_001_rigor_is_what_you_want --condition single  # Run single condition only

# Or via lab CLI (from lab-1337/)
lab-1337 run lep-001-rigor-is-what-you-want

Results

Results are written to results/ as JSON files with timestamps.

View with:

lab-1337 results lep-001-rigor-is-what-you-want

No results yet for this experiment.

Run the experiment with lab-1337 run rep-001