feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates by drewstone · Pull Request #44 · tangle-network/agent-knowledge

drewstone · 2026-06-26T14:49:31Z

What this settles

At n=3 the Autodata accept result was noise (one run 2/3, an independent re-run 0/3). This runs it at power and reports the accepted-rate with a confidence interval so n=1–2 luck can't fake it.

Verdict: WORKS AT POWER — not a coin-flip. Accept-rate 38%, Wilson 95% CI [23%, 55%] (12 of 32 slots cleared weak<0.5 ∧ strong≥0.65 ∧ gap≥0.2). The CI lower bound excludes ~0, and even the harder of the two docs (mixtral, 19% [7%, 43%]) excludes 0.

Design (powered, fixed-slots)

32 independent slots (each = one full challenger→refine→accept cycle), samples=4 per solver, maxRetries=2.
2 non-memorized grounding docs, 16 slots each: Mixtral (2401.04088) + DeepSeek-V3 (2412.19437), both post-llama-3.1-8b-cutoff MoE papers, prose chunks (LaTeX breaks the challenger's strict-JSON).
CIs are agent-eval's published wilson / pairedBootstrap — not hand-rolled. Stats recomputed from the on-disk per-attempt JSONL (interruption-safe).
Reuses the existing buildAutodataDataset loop unchanged — it already runs exactly target independent slots; only cross-slot aggregation + the two CIs are added.

Numbers

metric	value
accept-rate	38% CI [23%, 55%] (12/32)
accept-rate (producing slots only)	44% CI [28%, 63%] (12/27, excl. 5 LaTeX-in-JSON challenger failures)
— mixtral	19% CI [7%, 43%]
— deepseek-v3	56% CI [33%, 77%]
gap-widening Δ (plain→refined)	mean +0.103, CI [+0.029, +0.193] (paired bootstrap, n=27) — excludes 0
binding gate (weak<0.5)	39% of attempts (strong≥0.65: 88%, gap≥0.2: 52%)

Total live spend: $0.57 (32-slot run; ~$1.0 incl. pilots).

Two autopsied accepted examples (in docs/results/autodata-live.md) confirm real weak-fails-strong-derives discrimination on both docs — not a judge artifact or leakage.

The honest caveats (quantified, neither overturns the verdict)

Doc-dependent: 19% (mixtral) → 56% (deepseek-v3); pooled 38% is a real two-doc average.
The binding constraint is the weak model's competence, not the method: llama-3.1-8b struggles on only ~39% of attempts (weak median 0.55). The rate is a property of the tier gap — exactly what the discriminative reward should measure.

Verify

pnpm typecheck && pnpm lint && pnpm build && pnpm test all green (203 tests pass, 12 live-gated skipped). New offline coverage in src/autodata/powered.test.ts.

Reproduce: dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts

… reliably discriminates Run a fixed K=32 independent slots (16 each over two non-memorized MoE papers, samples=4) through the causal-challenger -> refine -> accept loop and report the accepted-rate with a Wilson 95% CI, the per-slot gap distribution, and the plain-vs-refined gap-widening with a paired-bootstrap CI. This settles the n=3 noise (one run 2/3, a re-run 0/3): acceptance is a real 38% rate, CI [23%, 55%] (12/32), not a coin-flip ~0. - powered.ts: a fixed-slots harness over the existing buildAutodataDataset (which already runs exactly `target` independent slots); only the cross-slot aggregation + the two CIs are added. CIs are agent-eval's published `wilson` / `pairedBootstrap`, never hand-rolled. Stats are recomputed from the on-disk per-attempt JSONL so an interrupted run loses no data. Surfaces challenger-stage (LaTeX-in-JSON) failures separately so they are not silently miscounted as discrimination rejects. - powered.test.ts: offline unit coverage for analyzeTrails (denominator = requested target, per-slot best-gap pairing, the accept-rule decomposition, multi-doc aggregation). - docs/results/autodata-live.md: replace the noisy n=3 section with the powered rate + CI, the gap distribution, and two autopsied accepted examples (real weak-fails-strong-derives on both docs).

tangletools

✅ Auto-approved PR — `07ee88dd`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T14:49:39Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	1 (1 low)
Heuristic	0.0s
Duplication	0.0s
Interrogation	137.8s (2 bridge agents)
Total	137.8s

💰 Value — sound

Adds a fixed-slots powered measurement harness over the existing autodata loop, reporting accept-rate with Wilson CIs and gap-widening with paired-bootstrap CIs so the n=3 coin-flip question can be settled at power.

What it does: Introduces src/autodata/powered.ts (and tests) that runs a configured number of independent autodata slots across two non-memorized grounding docs, writes per-attempt JSONL trails, and recomputes durable statistics from those trails: accept-rate with a Wilson 95% CI, per-slot best-gap distribution, plain→refined gap-widening with a paired-bootstrap CI, accept-rule decomposition, and per-doc brea
Goals it achieves: Settles whether the causal-challenger autodata loop reliably manufactures discriminating examples (accept-rate bounded away from zero) after the prior n=3 runs were too noisy to distinguish a real rate from ~0. It also makes the measurement interruption-safe and bounded-cost, with explicit confidence intervals and a cost gate.
Assessment: Good change, coherent and well-scoped. It layers on top of the existing loop without rewriting it, follows the repo's patterns (env knobs, smokeTestModels cost gate, JSONL trails, agent-eval stats, fail-loud), and adds offline unit tests for the aggregation edge cases (denominator = requested target, multi-doc pooling, condition decomposition).
Better / existing approach: none — this is the right approach. I checked: agent-eval exports wilson and pairedBootstrap but does not export a generic quantile/describe helper (its quantile is internal to matrix aggregation), so the local descriptive helpers are not duplication. The existing run.ts and calibrate.ts runners are single-doc / until-N-accepted and do no cross-slot CI aggregation, so there is no exis
Model: opencode/kimi-for-coding/k2p7
Bridge attempts: 1

🎯 Usefulness — sound

A well-fit powered-measurement runnable (third sibling of run.ts/calibrate.ts) that reuses the existing dataset loop + agent-eval's published CIs to settle whether the autodata loop reliably discriminates — reachable, on-pattern, interruption-safe.

Assessment: Net positive, not dead-end. It settles the actual question the prior n=3 run left open (accept-rate with a CI that excludes ~0) and emits a machine-readable RESULT_JSON (powered.ts:441) plus feeds the cited docs/results/autodata-live.md. The per-doc and per-condition decomposition is genuinely informative (identifies weak<0.5 as the binding gate), and analyzeTrails is reusable for any future re-
Integration: Reachable and wired correctly. powered.ts is a standalone runnable guarded at powered.ts:445 (process.argv[1].endsWith('powered.ts')), invoked exactly like the sibling pnpm tsx src/autodata/run.ts / calibrate.ts. analyzeTrails, DocTrail, PoweredStats are re-exported from src/autodata/index.ts:39 for programmatic re-use. It composes only existing pieces: buildAutodataDataset (build-
Fit with existing patterns: Excellent — fits the established grain precisely. It is a third sibling of run.ts and calibrate.ts: identical shape (env knobs via a local envInt, cost-gate smoke test, ground doc, call buildAutodataDataset, print a prose report). The inferential stats reuse agent-eval's published wilson/pairedBootstrap rather than hand-rolling, which matches the AGENTS.md layering rule that agent-knowledg
Real-world viability: Holds up off the happy path. Interruption-safe by design: final stats are recomputed from the on-disk JSONL via readTrail, which treats a missing trail file as an empty trail (powered.ts:302, ENOENT → []) so all-challenger-failed slots still count against the denominator; this is explicitly tested at powered.test.ts:42-52. analyzeTrails is pure + seeded-deterministic (powered.ts:180), so re-ru
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added src/autodata/powered.ts

console.log('\n══════════════════════════════════════════════════════════════════════════')

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260626T153836Z}

tangletools · 2026-06-26T15:47:16Z

✅ No Blockers — `07ee88dd`

Readiness 80/100 · Confidence 70/100 · 10 findings (1 medium, 9 low)

	opencode-kimi	glm	deepseek	aggregate
Readiness	80	83	85	80
Confidence	70	70	70	70
Correctness	80	83	85	80
Security	80	83	85	80
Testing	80	83	85	80
Architecture	80	83	85	80

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Condition decomposition thresholds hardcoded, diverge from configurable accept rule — src/autodata/powered.ts

Lines 171-173 hardcode minStrong=0.65, maxWeak=0.5, minGap=0.2 for the per-attempt condition decomposition (lines 238-241). But buildAutodataDataset accepts DiscriminativeThresholds (build-dataset.ts:22-26) that can override these thresholds. If a caller runs the loop with custom thresholds and then imports analyzeTrails to decomposing the resulting trails, the strongHi/weakLo/gapWide/all fractions will be computed against the wrong gates. The per-slot accept-rate is unaffected (it uses decision.accept). Fix: accept optional thr

🟡 LOW Reproduce command does not export the env vars needed to match the headline config — docs/results/autodata-live.md

The Design section (line 66-69) and the reproduce comment (line 149) state the headline run used AUTODATA_SLOTS_PER_DOC=16 and AUTODATA_MAXRETRIES=2. But the reproduce command at line 148 (pnpm tsx src/autodata/powered.ts) runs with code defaults, which are slotsPerDoc=14 and maxRetries=3 (powered.ts:377-379). The knobs are listed only as an inert comment (# knobs: ...), not exported, so

🟡 LOW Unexplained n=33 attempt denominator — docs/results/autodata-live.md

Lines 82-89 report the weak/strong score distributions and accept-rule decomposition over '33 quality-clean attempts', while the experimental design just above describes 32 total slots (16 per doc) with 5 challenger-stage failures leaving 27 producing slots. The relationship between 32 slots, 27 producing slots, and 33 attempts is not stated, so a reader has to infer that a slot may emit multiple quality-clean candidates across the refine fold. Concrete: the condition table says 'weak < 0.50 = 39%' (≈13/33) and 'all-three = 36%' (≈12/33), while the headline accept-rate is 12/32 slots. Impact: minor friction reconciling denominators; not a numerical error.

🟡 LOW AUTODATA_DOCS tag interpolated into paths unsanitized — src/autodata/powered.ts

Lines 413, 417, and 432 build paths data/powered-${doc.tag}.jsonl and data/powered-${doc.tag}-attempts.jsonl from AUTODATA_DOCS segments without rejecting path separators or ... A tag like ../../tmp/x would cause writes/reads/deletes outside data/. Fix by validating tags against /^[a-zA-Z0-9_-]+$/ or sanitizing the path segment.

🟡 LOW Biome lint warnings in new code — src/autodata/powered.ts

Lines 441 (console.log('RESULT_JSON ' + ...)) and 445 (if (process.argv[1] && process.argv[1].endsWith(...))) trigger lint/style/useTemplate and lint/complexity/useOptionalChain. Both are auto-fixable. Leaving them creates noise in the project's lint output. Fix by running biome check --write src/autodata/powered.ts for these two lines.

🟡 LOW Cost gate omits the judge model — src/autodata/powered.ts

Line 392 calls smokeTestModels with only [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL]. The loop still invokes the judge via buildAutodataDataset, so a dead or misconfigured AUTODATA_JUDGE_MODEL is not caught before the burn. Add JUDGE_MODEL to the smoke-test list so all four live roles are gated.

🟡 LOW Quality-failed plain drafts contaminate the paired widening CI — src/autodata/powered.ts

In analyzeTrails (lines ~270-275), earliest is selected from ALL rows via [...rows].sort((a,b) => a.iteration - b.iteration)[0], without filtering on qualityOk. Meanwhile bestGap (the refined side of the pair) is computed from quality-clean attempts only. So if the first (plain) draft fails the quality gate (leak/rubric) but a later refined draft is clean, the paired delta = cleanBestGap - failedPlainGap pairs an invalid baseline with a valid refined gap. The conditions decomposition correctly excludes quality-failed attempts (line ~283: if (!r.qualityOk) continue), but the widening pairing does not — an inconsistency. A quality-failed plain draft could have an artificially low gap (scores from a leaky example), inflating widening.meanDelta upward. Fix: filter earliest to q

🟡 LOW Two biome lint warnings in powered.ts — src/autodata/powered.ts

Line 441: console.log('RESULT_JSON ' + JSON.stringify(...)) triggers lint/style/useTemplate (prefer template literal). Line 445: process.argv[1] && process.argv[1].endsWith(...) triggers lint/complexity/useOptionalChain (prefer ?.). Both are auto-fixable nits, non-blocking.

🟡 LOW describe() emits NaN fields when every slot's challenger fails — src/autodata/powered.ts

If every slot produces zero attempts (all challenger failures), bestGapPerSlotAll and plainGapsPaired/refinedGapsPaired are empty. describe([]) returns {n:0, min:NaN, median:NaN, p90:NaN, max:NaN, mean:NaN}. The dist() formatter then prints 'NaN' values in the report. Not a crash (wilson(0,0) and pairedBootstrap([],[]) are safe per verified source), and unlikely given the cost gate, but the report degrades silently. Consider short-circuiting printReport with a 'no slots produced attempts' message when slotsWithAttempts === 0.

🟡 LOW envInt accepts non-integer string prefixes — src/autodata/powered.ts

Lines 64-70 use Number.parseInt(raw, 10), which silently accepts inputs like '14abc' as 14 or '0x10' as 16 for slot/sample/retry counts. This is a silent misconfiguration risk. Fix by validating with /^\d+$/ before parsing and throwing for non-integer strings.

_{tangletools · 2026-06-26T15:47:13Z · trace}

tangletools

✅ Approved — 10 non-blocking findings — `07ee88dd`

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-26T15:47:13Z · immutable trace}

tangletools approved these changes Jun 26, 2026

View reviewed changes

tangletools reviewed Jun 26, 2026

View reviewed changes

tangletools approved these changes Jun 26, 2026

View reviewed changes

drewstone merged commit 035a8d4 into main Jun 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates#44

feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates#44
drewstone merged 1 commit into
mainfrom
autodata/powered-accept-rate

drewstone commented Jun 26, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 26, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 26, 2026

What this settles

Design (powered, fixed-slots)

Numbers

The honest caveats (quantified, neither overturns the verdict)

Verify

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 07ee88dd

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

Uh oh!

tangletools commented Jun 26, 2026

✅ No Blockers — 07ee88dd

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 10 non-blocking findings — 07ee88dd

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `07ee88dd`

✅ No Blockers — `07ee88dd`

✅ Approved — 10 non-blocking findings — `07ee88dd`