Skip to content

feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates#44

Merged
drewstone merged 1 commit into
mainfrom
autodata/powered-accept-rate
Jun 26, 2026
Merged

feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates#44
drewstone merged 1 commit into
mainfrom
autodata/powered-accept-rate

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What this settles

At n=3 the Autodata accept result was noise (one run 2/3, an independent re-run 0/3). This runs it at power and reports the accepted-rate with a confidence interval so n=1–2 luck can't fake it.

Verdict: WORKS AT POWER — not a coin-flip. Accept-rate 38%, Wilson 95% CI [23%, 55%] (12 of 32 slots cleared weak<0.5 ∧ strong≥0.65 ∧ gap≥0.2). The CI lower bound excludes ~0, and even the harder of the two docs (mixtral, 19% [7%, 43%]) excludes 0.

Design (powered, fixed-slots)

  • 32 independent slots (each = one full challenger→refine→accept cycle), samples=4 per solver, maxRetries=2.
  • 2 non-memorized grounding docs, 16 slots each: Mixtral (2401.04088) + DeepSeek-V3 (2412.19437), both post-llama-3.1-8b-cutoff MoE papers, prose chunks (LaTeX breaks the challenger's strict-JSON).
  • CIs are agent-eval's published wilson / pairedBootstrap — not hand-rolled. Stats recomputed from the on-disk per-attempt JSONL (interruption-safe).
  • Reuses the existing buildAutodataDataset loop unchanged — it already runs exactly target independent slots; only cross-slot aggregation + the two CIs are added.

Numbers

metric value
accept-rate 38% CI [23%, 55%] (12/32)
accept-rate (producing slots only) 44% CI [28%, 63%] (12/27, excl. 5 LaTeX-in-JSON challenger failures)
— mixtral 19% CI [7%, 43%]
— deepseek-v3 56% CI [33%, 77%]
gap-widening Δ (plain→refined) mean +0.103, CI [+0.029, +0.193] (paired bootstrap, n=27) — excludes 0
binding gate (weak<0.5) 39% of attempts (strong≥0.65: 88%, gap≥0.2: 52%)

Total live spend: $0.57 (32-slot run; ~$1.0 incl. pilots).

Two autopsied accepted examples (in docs/results/autodata-live.md) confirm real weak-fails-strong-derives discrimination on both docs — not a judge artifact or leakage.

The honest caveats (quantified, neither overturns the verdict)

  1. Doc-dependent: 19% (mixtral) → 56% (deepseek-v3); pooled 38% is a real two-doc average.
  2. The binding constraint is the weak model's competence, not the method: llama-3.1-8b struggles on only ~39% of attempts (weak median 0.55). The rate is a property of the tier gap — exactly what the discriminative reward should measure.

Verify

pnpm typecheck && pnpm lint && pnpm build && pnpm test all green (203 tests pass, 12 live-gated skipped). New offline coverage in src/autodata/powered.test.ts.

Reproduce: dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts

… reliably discriminates

Run a fixed K=32 independent slots (16 each over two non-memorized MoE
papers, samples=4) through the causal-challenger -> refine -> accept loop
and report the accepted-rate with a Wilson 95% CI, the per-slot gap
distribution, and the plain-vs-refined gap-widening with a paired-bootstrap
CI. This settles the n=3 noise (one run 2/3, a re-run 0/3): acceptance is a
real 38% rate, CI [23%, 55%] (12/32), not a coin-flip ~0.

- powered.ts: a fixed-slots harness over the existing buildAutodataDataset
  (which already runs exactly `target` independent slots); only the
  cross-slot aggregation + the two CIs are added. CIs are agent-eval's
  published `wilson` / `pairedBootstrap`, never hand-rolled. Stats are
  recomputed from the on-disk per-attempt JSONL so an interrupted run loses
  no data. Surfaces challenger-stage (LaTeX-in-JSON) failures separately so
  they are not silently miscounted as discrimination rejects.
- powered.test.ts: offline unit coverage for analyzeTrails (denominator =
  requested target, per-slot best-gap pairing, the accept-rule decomposition,
  multi-doc aggregation).
- docs/results/autodata-live.md: replace the noisy n=3 section with the
  powered rate + CI, the gap distribution, and two autopsied accepted
  examples (real weak-fails-strong-derives on both docs).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 07ee88dd

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T14:49:39Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 1 (1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 137.8s (2 bridge agents)
Total 137.8s

💰 Value — sound

Adds a fixed-slots powered measurement harness over the existing autodata loop, reporting accept-rate with Wilson CIs and gap-widening with paired-bootstrap CIs so the n=3 coin-flip question can be settled at power.

  • What it does: Introduces src/autodata/powered.ts (and tests) that runs a configured number of independent autodata slots across two non-memorized grounding docs, writes per-attempt JSONL trails, and recomputes durable statistics from those trails: accept-rate with a Wilson 95% CI, per-slot best-gap distribution, plain→refined gap-widening with a paired-bootstrap CI, accept-rule decomposition, and per-doc brea
  • Goals it achieves: Settles whether the causal-challenger autodata loop reliably manufactures discriminating examples (accept-rate bounded away from zero) after the prior n=3 runs were too noisy to distinguish a real rate from ~0. It also makes the measurement interruption-safe and bounded-cost, with explicit confidence intervals and a cost gate.
  • Assessment: Good change, coherent and well-scoped. It layers on top of the existing loop without rewriting it, follows the repo's patterns (env knobs, smokeTestModels cost gate, JSONL trails, agent-eval stats, fail-loud), and adds offline unit tests for the aggregation edge cases (denominator = requested target, multi-doc pooling, condition decomposition).
  • Better / existing approach: none — this is the right approach. I checked: agent-eval exports wilson and pairedBootstrap but does not export a generic quantile/describe helper (its quantile is internal to matrix aggregation), so the local descriptive helpers are not duplication. The existing run.ts and calibrate.ts runners are single-doc / until-N-accepted and do no cross-slot CI aggregation, so there is no exis
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A well-fit powered-measurement runnable (third sibling of run.ts/calibrate.ts) that reuses the existing dataset loop + agent-eval's published CIs to settle whether the autodata loop reliably discriminates — reachable, on-pattern, interruption-safe.

  • Assessment: Net positive, not dead-end. It settles the actual question the prior n=3 run left open (accept-rate with a CI that excludes ~0) and emits a machine-readable RESULT_JSON (powered.ts:441) plus feeds the cited docs/results/autodata-live.md. The per-doc and per-condition decomposition is genuinely informative (identifies weak<0.5 as the binding gate), and analyzeTrails is reusable for any future re-
  • Integration: Reachable and wired correctly. powered.ts is a standalone runnable guarded at powered.ts:445 (process.argv[1].endsWith('powered.ts')), invoked exactly like the sibling pnpm tsx src/autodata/run.ts / calibrate.ts. analyzeTrails, DocTrail, PoweredStats are re-exported from src/autodata/index.ts:39 for programmatic re-use. It composes only existing pieces: buildAutodataDataset (build-
  • Fit with existing patterns: Excellent — fits the established grain precisely. It is a third sibling of run.ts and calibrate.ts: identical shape (env knobs via a local envInt, cost-gate smoke test, ground doc, call buildAutodataDataset, print a prose report). The inferential stats reuse agent-eval's published wilson/pairedBootstrap rather than hand-rolling, which matches the AGENTS.md layering rule that agent-knowledg
  • Real-world viability: Holds up off the happy path. Interruption-safe by design: final stats are recomputed from the on-disk JSONL via readTrail, which treats a missing trail file as an empty trail (powered.ts:302, ENOENT → []) so all-challenger-failed slots still count against the denominator; this is explicitly tested at powered.test.ts:42-52. analyzeTrails is pure + seeded-deterministic (powered.ts:180), so re-ru
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/autodata/powered.ts

  • console.log('\n══════════════════════════════════════════════════════════════════════════')

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260626T153836Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 07ee88dd

Readiness 80/100 · Confidence 70/100 · 10 findings (1 medium, 9 low)

opencode-kimi glm deepseek aggregate
Readiness 80 83 85 80
Confidence 70 70 70 70
Correctness 80 83 85 80
Security 80 83 85 80
Testing 80 83 85 80
Architecture 80 83 85 80

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Condition decomposition thresholds hardcoded, diverge from configurable accept rule — src/autodata/powered.ts

Lines 171-173 hardcode minStrong=0.65, maxWeak=0.5, minGap=0.2 for the per-attempt condition decomposition (lines 238-241). But buildAutodataDataset accepts DiscriminativeThresholds (build-dataset.ts:22-26) that can override these thresholds. If a caller runs the loop with custom thresholds and then imports analyzeTrails to decomposing the resulting trails, the strongHi/weakLo/gapWide/all fractions will be computed against the wrong gates. The per-slot accept-rate is unaffected (it uses decision.accept). Fix: accept optional thr

🟡 LOW Reproduce command does not export the env vars needed to match the headline config — docs/results/autodata-live.md

The Design section (line 66-69) and the reproduce comment (line 149) state the headline run used AUTODATA_SLOTS_PER_DOC=16 and AUTODATA_MAXRETRIES=2. But the reproduce command at line 148 (pnpm tsx src/autodata/powered.ts) runs with code defaults, which are slotsPerDoc=14 and maxRetries=3 (powered.ts:377-379). The knobs are listed only as an inert comment (# knobs: ...), not exported, so

🟡 LOW Unexplained n=33 attempt denominator — docs/results/autodata-live.md

Lines 82-89 report the weak/strong score distributions and accept-rule decomposition over '33 quality-clean attempts', while the experimental design just above describes 32 total slots (16 per doc) with 5 challenger-stage failures leaving 27 producing slots. The relationship between 32 slots, 27 producing slots, and 33 attempts is not stated, so a reader has to infer that a slot may emit multiple quality-clean candidates across the refine fold. Concrete: the condition table says 'weak < 0.50 = 39%' (≈13/33) and 'all-three = 36%' (≈12/33), while the headline accept-rate is 12/32 slots. Impact: minor friction reconciling denominators; not a numerical error.

🟡 LOW AUTODATA_DOCS tag interpolated into paths unsanitized — src/autodata/powered.ts

Lines 413, 417, and 432 build paths data/powered-${doc.tag}.jsonl and data/powered-${doc.tag}-attempts.jsonl from AUTODATA_DOCS segments without rejecting path separators or ... A tag like ../../tmp/x would cause writes/reads/deletes outside data/. Fix by validating tags against /^[a-zA-Z0-9_-]+$/ or sanitizing the path segment.

🟡 LOW Biome lint warnings in new code — src/autodata/powered.ts

Lines 441 (console.log('RESULT_JSON ' + ...)) and 445 (if (process.argv[1] && process.argv[1].endsWith(...))) trigger lint/style/useTemplate and lint/complexity/useOptionalChain. Both are auto-fixable. Leaving them creates noise in the project's lint output. Fix by running biome check --write src/autodata/powered.ts for these two lines.

🟡 LOW Cost gate omits the judge model — src/autodata/powered.ts

Line 392 calls smokeTestModels with only [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL]. The loop still invokes the judge via buildAutodataDataset, so a dead or misconfigured AUTODATA_JUDGE_MODEL is not caught before the burn. Add JUDGE_MODEL to the smoke-test list so all four live roles are gated.

🟡 LOW Quality-failed plain drafts contaminate the paired widening CI — src/autodata/powered.ts

In analyzeTrails (lines ~270-275), earliest is selected from ALL rows via [...rows].sort((a,b) => a.iteration - b.iteration)[0], without filtering on qualityOk. Meanwhile bestGap (the refined side of the pair) is computed from quality-clean attempts only. So if the first (plain) draft fails the quality gate (leak/rubric) but a later refined draft is clean, the paired delta = cleanBestGap - failedPlainGap pairs an invalid baseline with a valid refined gap. The conditions decomposition correctly excludes quality-failed attempts (line ~283: if (!r.qualityOk) continue), but the widening pairing does not — an inconsistency. A quality-failed plain draft could have an artificially low gap (scores from a leaky example), inflating widening.meanDelta upward. Fix: filter earliest to q

🟡 LOW Two biome lint warnings in powered.ts — src/autodata/powered.ts

Line 441: console.log('RESULT_JSON ' + JSON.stringify(...)) triggers lint/style/useTemplate (prefer template literal). Line 445: process.argv[1] && process.argv[1].endsWith(...) triggers lint/complexity/useOptionalChain (prefer ?.). Both are auto-fixable nits, non-blocking.

🟡 LOW describe() emits NaN fields when every slot's challenger fails — src/autodata/powered.ts

If every slot produces zero attempts (all challenger failures), bestGapPerSlotAll and plainGapsPaired/refinedGapsPaired are empty. describe([]) returns {n:0, min:NaN, median:NaN, p90:NaN, max:NaN, mean:NaN}. The dist() formatter then prints 'NaN' values in the report. Not a crash (wilson(0,0) and pairedBootstrap([],[]) are safe per verified source), and unlikely given the cost gate, but the report degrades silently. Consider short-circuiting printReport with a 'no slots produced attempts' message when slotsWithAttempts === 0.

🟡 LOW envInt accepts non-integer string prefixes — src/autodata/powered.ts

Lines 64-70 use Number.parseInt(raw, 10), which silently accepts inputs like '14abc' as 14 or '0x10' as 16 for slot/sample/retry counts. This is a silent misconfiguration risk. Fix by validating with /^\d+$/ before parsing and throwing for non-integer strings.


tangletools · 2026-06-26T15:47:13Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 10 non-blocking findings — 07ee88dd

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-26T15:47:13Z · immutable trace

@drewstone drewstone merged commit 035a8d4 into main Jun 26, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants