Skip to content

feat(supervise): propagated analyze knob — analyst feeds the driver (additive)#404

Merged
drewstone merged 6 commits into
mainfrom
feat/supervise-analyze-knob
Jun 28, 2026
Merged

feat(supervise): propagated analyze knob — analyst feeds the driver (additive)#404
drewstone merged 6 commits into
mainfrom
feat/supervise-analyze-knob

Conversation

@drewstone

@drewstone drewstone commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

The self-improving supervisor — keystone + loop, end-to-end

A driver-steered, self-improving supervisor: the driver steers the worker, an analyst feeds the driver each round ("here's the last round, compose the next prompt"), the driver's prompt is GEPA-optimized on a frozen holdout, and a cost-aware ablation measures what actually helps. Purely additive — every new knob defaults off, so the status-quo supervisor is byte-identical.

1. Keystone — the propagated analyze knob (additive, every level)

supervise() / supervisorAgent / driverAgent gain optional analysts + analyzeOnSettle, threaded to createCoordinationTools (router arm) AND serveCoordinationMcp (sandbox arm). On a worker settle, each analyst lens runs → a finding the driver pulls (await_event) and composes its next steer from — the self-improving up-leg that was previously unreachable through supervise(). Defaults off → status quo byte-identical. 82 supervise/coordination tests green.

2. The loop layers (on the keystone)

  • surface-worker.ts — the graded-worker seam: a makeWorkerAgent running runAgentic over an AgenticSurface, settling with the surface score as the deliverable verdict (settled ⟺ resolved, not a self-report).
  • gepa-driver-prompt.ts — optimize the driver's compose-prompt on TRAIN with an executable JudgeConfig (selfImprove + gepaProposer reading the surface score, not an LLM judge), frozen, certified.
  • self-improving-supervisor.ts — the one-call DX composing supervise(+analyze) over the graded seam.
  • ablation.tsdriverSteer + optimize knobs wired (blind / steered / self-improving over the same supervise()), per-task resilience, cost + paired-bootstrap significance autopsy.

3. Live e2e — proven, no mocks

The driver-steered supervisor (analyze on) over the contamination-proof generated coding task: resolved=true score=1.00 $0.031 / 141s — the driver spawned a worker, the analyst fed the driver, a worker resolved the task, the graded winner was delivered. The loop closes end-to-end.

4. #402 hardening (folded in)

swe-bench-env: temp-dir leak fix (try/catch + rmSync on clone/checkout failure), symlink realpath-jail (read + edit), workspaces Map into the closure, exported jailPath/isTestPath + unit tests. self-improving-coder: read_file path guard (symmetric with write_file), dead run_tests handler removed. swe-self-improve moved to bench/src/ (now under typecheck). ablation per-task try/catch + biome.

Supersedes #402 (examples + instrument folded here). Typecheck clean (examples + core); biome clean.

Honest framing: this is the trustworthy instrument. Whether smart coordination beats simple at equal cost is what the ablation measures next on SWE-bench — and across every clean experiment so far, smart has tied simple. The point of the instrument is to kill our own hype with receipts.

…additive

supervise() / supervisorAgent / driverAgent gain optional analysts + analyzeOnSettle, threaded to
createCoordinationTools (router arm) AND serveCoordinationMcp (sandbox arm). When a worker settles done,
each configured analyst lens runs and re-enters as a finding the driver pulls (await_event) and composes
its next steer from — the self-improving UP-leg that was previously unreachable through supervise() (it
lived only in a hand-wired createCoordinationTools). Purely additive: every field optional, defaults off,
so status-quo supervise(profile, task, {backend, budget}) is byte-identical. Sub-driver propagation rides
a recursive makeWorkerAgent (the caller's seam).
@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 388d41b4

Readiness 89/100 · Confidence 65/100 · 3 findings (3 low)

glm deepseek aggregate
Readiness 89 92 89
Confidence 65 65 65
Correctness 89 92 89
Security 89 92 89
Testing 89 92 89
Architecture 89 92 89

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟡 LOW No test exercises analysts/analyzeOnSettle propagation through supervise()/supervisorAgent() — src/runtime/supervise/supervise.ts

The new options are threaded through three wrappers (supervise.ts:151-152, supervisor-agent.ts:137-138 and 160-161, coordination-driver.ts:210-211), but the existing suite only covers the downstream consumer (coordination.test.ts:482 calls createCoordinationTools directly). A regression that drops one of these spread lines would not be caught. Impact: silent feature loss, not correctness. Fix: one supervisor-agent.test.ts case that builds supervisorAgent with analysts + analyzeOnSettle set and asserts the registry reaches the spawned worker settle hook (e.g., via a spy makeWorkerAgent).

🟡 LOW analyzeOnSettle silently no-ops without analysts — src/runtime/supervise/supervise.ts

All three layers (supervise.ts:89, supervisor-agent.ts:100, coordination-driver.ts:67) document that analyzeOnSettle 'Requires analysts', but none of them validate that constraint at construction time. The actual guard lives only in createCoordinationTools at coordination.ts:255 (if … opts.analysts && opts.analyzeOnSettle?.length), so a misconfigured supervisor that sets analyzeOnSettle without analysts silently skips all analysis — the run still completes but wastes compute on a supervisor that appears to be self-improving but isn't. An early ValidationError at supervise() construction would fail loud instead. Not a crash or data risk; fit severity reflects the silent-compute-waste failure mode.

🟡 LOW 'Requires analysts' contract is unenforced at this layer — src/runtime/supervise/supervisor-agent.ts

Docstrings at supervise.ts:88, supervisor-agent.ts:101, and coordination-driver.ts:66 all state analyzeOnSettle requires analysts, but no ValidationError fires when only analyzeOnSettle is set. The downstream guard at coordination.ts:255 (opts.analysts && opts.analyzeOnSettle?.length) silently no-ops, so the user gets status quo instead of a fail-loud. The enforcement belongs in createCoordinationTools (out of shot), but adding a construction-time guard alongside the existing maxTurns/extraTools guards would surface misconfiguration at the layer this shot edits. Low impact — silent no-op, no data loss.


tangletools · 2026-06-28T05:31:35Z · trace

tangletools
tangletools previously approved these changes Jun 28, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 3 non-blocking findings — 388d41b4

Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T05:31:35Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 1 (1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 164.3s (2 bridge agents)
Total 164.3s

💰 Value — sound

Threads an already-implemented analyst-feed primitive (analysts + analyzeOnSettle) through the three higher supervise layers that weren't wiring it; pure forwarding, additive, in-grain, no existing equivalent duplicated.

  • What it does: Adds two optional fields — analysts (an AnalystRegistry of trace-analysis lenses) and analyzeOnSettle (analyst kind ids) — to DriverAgentOptions, SupervisorAgentDeps, and SuperviseOptions, then conditionally spreads them into the next layer down. The router arm (driverAgent) forwards them to its inline createCoordinationTools call (coordination-driver.ts:210-211); supervisorAgent f
  • Goals it achieves: Make the self-improving UP-leg ('analyst feeds the driver') reachable from the canonical one-call API supervise(profile, task, opts). Before this PR the mechanism was fully built at the primitive layer (createCoordinationTools + serveCoordinationMcp) but the supervise → supervisorAgent → driverAgent path wired it at zero levels — so the highest-level, documented entry point could not produ
  • Assessment: A clean, minimal, correct forwarding change. (1) It genuinely closes a real reachability gap — verified that driverAgent's inline createCoordinationTools (coordination-driver.ts:204-212) pre-PR built the tools without these fields, so the router arm literally could not fire the analyst hook. (2) It is byte-identical when unset: every field is optional and the conditional-spread idiom (`...(opts.
  • Better / existing approach: none — this is the right approach. The architecture is a deliberate 3-layer onion where each layer owns a typed, doc-commented options interface and forwards individual knobs via conditional spread; every other optional knob in these same files follows this identical pattern, so the alternative (a shared options-bag passed wholesale) would fight the grain across ~8 existing knobs. The analyst mech
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound

Purely additive propagation of an existing, lower-layer analyst-on-settle mechanism up to the canonical supervise/supervisorAgent/driverAgent entry points — matches the established maxLiveWorkers threading pattern exactly.

  • Integration: Correctly threaded to BOTH arms: the router arm (supervisor-agent.ts:127-141 → driverAgent → coordination-driver.ts:204-212 → createCoordinationTools) and the sandbox arm (supervisor-agent.ts:151-162 → serveCoordinationMcp → coordination-mcp.ts:69-79 → createCoordinationTools). Verified pre-PR state had neither field at these upper layers (git show origin/main confirms), so this PR genuinely opens
  • Fit with existing patterns: Mirrors the grain exactly. The ...(opts.x ? { x: opts.x } : {}) propagation is identical to how maxLiveWorkers is already threaded through the same three files. The separate effort.ts:158 withAnalyst: boolean flag is a DIFFERENT concept (scope-level per-agent ScopeAnalyst passed to runPersonified) and does not compete — different layer, different name, different mechanism. No existing equiva
  • Real-world viability: Additive and default-off, so status-quo callers see byte-identical behavior. The analyzeOnSettle doc says 'Requires analysts' but the lower layer enforces this defensively (coordination.ts:255 short-circuits via opts.analysts && opts.analyzeOnSettle?.length), so a misconfiguration silently no-ops rather than crashing — robustness is preserved. run_analyst (coordination.ts:683) uses optional ch
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

💰 Value Audit

🟡 No eager validation that analyzeOnSettle requires analysts (symmetry with existing fail-loud guards) [maintenance] ``

The doc comments on all three new fields state 'Requires analysts', but no layer validates it. coordination-driver.ts already throws eagerly for half-wired seams of the same shape — extraTools without executeExtraTool (lines 171-175) and negative maxTurns (189-193), both following the 'fail loud at construction, not buried in a swallowed act() throw' rule stated at lines 177-186. By contrast, analyzeOnSettle without analysts is a SILENT no-op (coordination.ts:255 guards `opts.analysts && opts.an


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T053222Z

LOOP LAYERS (on the analyze keystone):
- surface-worker.ts: graded-worker seam — a makeWorkerAgent running runAgentic over an AgenticSurface,
  settling with the surface score as the deliverable verdict (driver spawns/steers graded workers).
- gepa-driver-prompt.ts: optimize the driver compose-prompt on TRAIN with an EXECUTABLE JudgeConfig
  (selfImprove + gepaProposer, reading the surface score — not an LLM judge), frozen, certified.
- self-improving-supervisor.ts: one-call DX composing supervise(+analyze) over the graded seam.
- ablation.ts: driverSteer + optimize knobs WIRED (blind/steered/self-improving) over the same
  supervise(); per-task try/catch resilience; cost+significance autopsy intact.
- index.ts: re-export AnalystRegistry + MakeWorkerAgent from the loops barrel (host authors its seam).

#402 HARDENING:
- swe-bench-env: temp-dir leak fix (try/catch+rmSync on clone/checkout fail), symlink realpath-jail
  (read+edit), workspaces Map into the closure, exported jailPath/isTestPath + unit tests.
- self-improving-coder: read_file path guard (symmetric with write_file), dead run_tests handler gone.
- swe-self-improve.mts moved to bench/src/ (now under typecheck).

Typecheck clean (examples + core); biome clean; 82 supervise tests green on the keystone.
tangletools
tangletools previously approved these changes Jun 28, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 1cb0d45c

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T05:59:55Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 1 (1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 267.5s (2 bridge agents)
Total 267.5s

💰 Value — sound

Threads two already-implemented options (analysts/analyzeOnSettle) from the public supervise() API down to createCoordinationTools (which already ran them), opening the analyst-feeds-driver loop; purely additive, in-grain plumbing with no reinvention.

  • What it does: The true delta vs main is 11 files (+702/-52). The core runtime change (33 lines across coordination-driver.ts, supervise.ts, supervisor-agent.ts) adds optional analysts + analyzeOnSettle fields to SuperviseOptions / SupervisorAgentDeps / DriverAgentOptions and threads them — on both the router arm and the sandbox arm — down to createCoordinationTools, which ALREADY implemented the pub
  • Goals it achieves: Make the self-improving UP-leg reachable from the canonical supervise → supervisorAgent → driverAgent path. Before this, the analyst-on-settle mechanism existed ONLY inside a hand-wired createCoordinationTools call — the top-level supervise() API wired it at zero levels, so a user of the public API could never get 'the analyst feeds the driver.' After merge, `supervise(profile, task, { analy
  • Assessment: Sound and in the grain. The runtime threading is genuinely additive: every field is optional and defaults off (conditional spreads ...(opts.x ? {x} : {})), so a status-quo supervise() call is unchanged. It mirrors the EXACT established pattern already used for maxLiveWorkers/extraTools in all three files — it does not invent a new threading mechanism. Critically, it does NOT reimplement th
  • Better / existing approach: none — this is the right approach. Searched for an existing path-containment utility to reuse for the bench jail (rg over src/ and packages/ for realpathSync/startsWith(... + sep)/isInsideJail): none exists outside the bench dir, so the localized utility is not a duplication. Confirmed the examples do not reimplement loops — selfImprovingSupervisor calls supervise() with the new knob (se
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound

Threads the analyze knob (analysts + analyzeOnSettle) through supervise → supervisorAgent → driverAgent (router arm) and supervisorAgent → serveCoordinationMcp (sandbox arm), closing a real reachability gap where the self-improving up-leg existed in createCoordinationTools but was zero-wired at the

  • Integration: Fully wired and reachable. supervise.ts:151-152 forwards analysts/analyzeOnSettle into supervisorAgent deps; supervisor-agent.ts:137-138 threads them into driverAgent (router arm) and :160-161 into serveCoordinationMcp (sandbox arm), which already accepts them (coordination-mcp.ts:62-76) and forwards to createCoordinationTools. The underlying mechanism — drainSettlement auto-running each configure
  • Fit with existing patterns: Exactly in-grain. Every other supervisor knob (extraTools, maxLiveWorkers, compaction, maxTurns) follows the identical optional-field-threaded-through-each-layer pattern with the same ...(opts.x ? { x } : {}) spread idiom; the analysts/analyzeOnSettle pair is the same shape. No competing pattern exists — the AnalystRegistry type and the finding/settled/bus taxonomy are the established coordinati
  • Real-world viability: Additive and defaults off — a status-quo supervise(profile, task, {backend, budget}) call constructs no analysts registry and the drainSettlement hook at coordination.ts:255 short-circuits on opts.analysts && opts.analyzeOnSettle?.length, so status-quo behavior is byte-identical. A throwing analyst lens is caught by runTool (coordination-driver.ts:351-358) and folded back as an error string the
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  •  console.log(
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T060617Z

…ob for resolved/score

The supervise winner result carries the driver's finalize output in result.out (the best-delivered
worker's SurfaceWorkerOut), NOT a verdict field — reading result.verdict?.score always returned 0.
Read resolved/score off result.out. Live e2e verified: driver-steered supervisor (analyze on) over the
contamination-proof codingEnv resolves the task — resolved=true score=1.00 $0.031, the loop closes
end-to-end (driver spawns worker, analyst feeds driver, graded winner delivered).
tangletools
tangletools previously approved these changes Jun 28, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — b16ce792

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T06:08:59Z

@drewstone drewstone dismissed tangletools’s stale review June 28, 2026 06:11

The merge-base changed after approval.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 34a5b702

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T06:11:27Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (1 low, 3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 375.8s (2 bridge agents)
Total 375.8s

💰 Value — sound-with-nits

Threads an already-built analyst-feed substrate knob up through the one-call supervise() API (the real, in-grain keystone), plus three example/recipe files composing it into a driver-steered + GEPA-tuned supervisor and a bench security hardening — additive end-to-end, with two honest v1 methodology

  • What it does: (1) KEYSTONE (src/runtime/supervise/{supervise,supervisor-agent,coordination-driver}.ts, +31 lines): the substrate already supported analysts + analyzeOnSettle (src/mcp/tools/coordination.ts:99-106,255-261, added prior; coordination-mcp.ts:62-76). This PR propagates those two optional fields up through driverAgent → supervisorAgent (both the router arm at supervisor-agent.ts:134-135 and the sandbo
  • Goals it achieves: Make the self-improving up-leg (analyst reads a settled worker → finding → driver composes the next steer) reachable through the one-call supervise() API, which was the last closed door: the substrate existed only inside hand-wired createCoordinationTools. Secondary: ship a runnable, honest ablation harness that can measure whether driver-steering and GEPA prompt-tuning actually help, and harden t
  • Assessment: The keystone is unambiguously good and the grain is exact: it matches how every other supervise() knob (maxLiveWorkers, compaction, extraTools) is threaded — optional field + conditional spread at each of three layers, both arms. The substrate already existed and was simply unreachable from the one-call door; this opens it with zero behavior change when unset. The examples compose real primitives
  • Better / existing approach: none for the keystone — threading the existing substrate knob through the one-call API is the right approach, matching the exact pattern of every peer knob. For the examples, the v1 simplifications are acknowledged next-increments, not architecture flaws. searched: all MakeWorkerAgent factories (only workerFromBackend exists, different substrate); all selfImprove facades (only improve() [profile-b
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound-with-nits

A coherent, additive keystone — analyzeOnSettle/analysts on supervise() propagates through both arms to a real, tested consumption site — plus first example loop layers that compose existing substrate seams; nothing dead or competing, one honest v1 proxy worth a human's awareness.

  • Integration: Fully wired and reachable. The keystone options thread supervise() (supervise.ts:84-89,151-152) → supervisorAgent (supervisor-agent.ts:97-101) → BOTH arms: the router arm driverAgent (coordination-driver.ts:61-66,210-211) and the sandbox arm serveCoordinationMcp (coordination-mcp.ts:62-64,75-76), both reaching the real consumption site at coordination.ts:255-260 where a worker settle fires
  • Fit with existing patterns: Follows the established grain. surfaceWorkerSeam is a clean sibling of workerFromBackend (supervise.ts:26) — both return a {makeWorkerAgent, deliverable?} pair that supervise() consumes. optimizeDriverPrompt calls the real selfImprove+gepaProposer from agent-eval with an EXECUTABLE judge reading the surface score (gepa-driver-prompt.ts:119-133) — no reinvention, no LLM-judge flattery
  • Real-world viability: Holds up on the happy path and compiles+tests green. Two honest v1 tradeoffs are explicitly documented in-code, not hidden: (1) v1 workers ignore the driver's per-worker brief — each spawn is a fresh refine attempt on the same task (surface-worker.ts:16-18), so the driver's intelligence in v1 is allocation (spawn/stop), not instruction authoring; (2) the GEPA pass optimizes the candidate string
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  •  console.log(
    

💰 Value Audit

🟡 GEPA optimizes the prompt in one role (refine's between-shot analyst) but deploys it in another (the supervisor's standing systemPrompt) [better-architecture] ``

In gepa-driver-prompt.ts:101-113 the candidate steerer is passed as analystInstruction to runAgentic-refine (refine's between-shot steerer, per strategy.ts:369), and the winner is returned as systemPrompt (gepa-driver-prompt.ts:157). But in self-improving-supervisor.ts:77 that winner is deployed as profile.systemPrompt — the supervisor BRAIN's standing instruction, a different role than the refine analyst instruction it was fitness-scored against. The docblock (gepa-driver-prompt.ts:17-20) hones

🟡 v1 worker ignores the driver's brief, so the driverSteer ablation arm cannot show steering's effect [proportion] ``

surface-worker.ts:68-71 (the 'v1 SIMPLIFICATION' comment): each spawn is a fresh refine on the SAME task; the driver's spawn brief / steer is discarded. The code is honest about this ('the driver's intelligence in v1 is allocation ... not per-worker instruction authoring'). But the downstream consequence is that the ablation's driver-steer / driver-gepa arms (ablation.ts deltas) burn supervisor-inference tokens (reading settles, composing steers, running analysts) whose effect on the worker is n

🎯 Usefulness Audit

🟡 GEPA optimizes a proxy lever; train and deploy regimes diverge [problem-fit] ``

optimizeDriverPrompt measures each candidate's fitness as a between-shot analystInstruction inside one refine rollout (gepa-driver-prompt.ts:101-113), then deploys the winner as the driver's systemPrompt in supervise() (self-improving-supervisor.ts:77). Combined with the v1 simplification that spawned workers ignore the driver's brief (surface-worker.ts:16-18), the string GEPA tunes and the string the driver runs govern different control surfaces — training optimizes within-worker shot


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260628T061936Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — 34a5b702

Readiness 4/100 · Confidence 80/100 · 18 findings (2 high, 5 medium, 11 low)

glm deepseek aggregate
Readiness 4 53 4
Confidence 80 80 80
Correctness 4 53 4
Security 4 53 4
Testing 4 53 4
Architecture 4 53 4

Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH GEPA optimize cost silently dropped from arm costUsd — invalidates the fair-cost invariant — examples/ablation-suite/ablation.ts

Lines 144-156 call optimizeDriverPrompt, whose return type (gepa-driver-prompt.ts:163) is {systemPrompt, lift, shipped} with NO cost field. selfImprove with generations=1, populationSize=2 runs a baseline cell + 2 candidate cells across the train slice (each cell = a full refine rollout via runAgentic), so the GEPA pass spends real $ that often exceeds the held-out supervisor run's cost. None of it is added to the arm's usd accumulator (line 202 only adds sup.usd from the held-out loop). The ablation header

🔴 HIGH Supervisor budget mis-sizing prevents multi-worker steering (driverSteer/optimize arms non-functional as designed) — examples/ablation-suite/ablation.ts

Lines 193-196 pass budget: { maxIterations: arm.knobs.budget, maxTokens: (opts.worker.maxTokens ?? 4000) * Math.max(1, arm.knobs.budget) }. With the default BUDGET=2 and maxTokens=4000 this yields a root pool of maxIterations=2, maxTokens=8000. supervise.ts:109 defaultPerWorker forwards budget.maxIterations unchanged (only divides tokens by 4), so each worker reservation = {maxIterations: 2, maxTokens: 2000}. The surfaceWorkerExecutor reports spent.iterations = r.completions (easily >2 for innerTurns=6), and scope.ts:719 clampSpend clamps it back to the reservation ceiling of 2. After ONE worker settles, freeIterations = 2 - 2 = 0; budget.ts

Other

🟠 MEDIUM No integration test exercises the jail through the actual call() tool dispatch — bench/src/swe-bench-env.test.ts

isInsideJail, jailPath, and isTestPath are unit-tested in isolation (lines 7-69), but no test drives environment.call(handle, 'read_file', {path: 'escape/passwd'}) or edit_file against a workspace containing an escaping symlink. The security boundary is the composition safe() -> resolveInJail() -> readFileSync/writeFileSync inside call() (swe-bench-env.ts:135-176); a wiring bug (wrong variable, missing resolveInJail in one branch) would not be caught by the current suite. The test comment at line 46 acknowledges it 'mirrors'

🟠 MEDIUM Data integrity: uncaptured columns rendered as zero in ablation output — examples/ablation-suite/ablation.ts

Lines 170-222: the driverSteer path accumulates only usd from selfImprovingSupervisorti, to, ms, shots, comps remain 0 (code comment at L177-178 acknowledges 'UNCAPTURED, not a real zero'). Lines 235-246 construct ArmResult with these zeroes, and printAutopsy (L258-291) renders 0s latency and 0.0 shots identically to a genuinely-zero-cost run. An ablation reader comparing driver-steer (0s, 0.0 shots) against baseline (real latency/shots) will draw invalid conclusions about cost. F

🟠 MEDIUM progressAnalyst summary always renders '[object Object]' — up-leg is non-functional — examples/ablation-suite/self-improving-supervisor.ts

Line 62: run: async (_kindId, trace) => ({ summary: \worker produced: ${String(trace).slice(0, 400)}` }). The traceargument is the worker's settled blob — aSurfaceWorkerOutobject ({resolved, score, shots, summary}), confirmed at coordination.ts:256const trace = await opts.blobs.get(w.outRef). String(obj)` on a plain object yields the literal '[object Object]', so every analyst finding re-entered on the bus is the identical string 'worker produced: [object Object]'. The module's own header ([lines 4-9](https://github.com/tangle-network/agent-runtime/blob/34a5b702227fd0eeada24c030a7a7a64187bcb03/examples/ablation-suite/self-i

🟠 MEDIUM No test coverage for analysts/analyzeOnSettle at driverAgent, supervisorAgent, or supervise layers — src/runtime/supervise/coordination-driver.ts

The coordination MCP tools have existing analyst tests (coordination.test.ts:320-341 tests list_analysts/run_analyst, coordination.test.ts:482-486 tests analyzeOnSettle hook), but the new pass-through options at the driverAgent, supervisorAgent, and supervise levels have zero test coverage. coordination-driver.test.ts has no analysts-related tests. The supervise/supervisorAgent test files have no references to 'analysts' or 'analyzeOnSettle' at all. A test confirming that driverAgent({..., analysts: mockRegistry, analyzeOnSettle: ['x']}) correctly passes both to createCoordinationTools would close the coverage gap.

🟠 MEDIUM analyzeOnSettle without analysts silently no-ops — missing ValidationError — src/runtime/supervise/coordination-driver.ts

All 4 files document analyzeOnSettle as 'Requires analysts', but no layer throws ValidationError when analyzeOnSettle is non-empty and analysts is unset. The only gate is a soft opts.analysts && check at coordination.ts:255 in drainSettlement, which silently skips analysis. This is inconsistent with the extraTools/executeExtraTool pattern at coordination-driver.ts:171-174 which fails loud with ValidationError. Users who set analyzeOnSettle without analysts will get no findings and no error — a debugging time-sink. Add a ValidationError in driverAgent (matching the extraTools pattern) or at minimum in createCoordinationTools.

🟡 LOW Test hardcodes /etc/passwd and asserts its realpath, reducing portability — bench/src/swe-bench-env.test.ts

The symlink-escape test (lines 53-60) does symlinkSync('/etc', link), then realpathSync(join(dir,'escape/passwd')) and asserts expect(real).toBe('/etc/passwd'). This assumes /etc/passwd exists and /etc is not itself a symlink (true on Linux and macOS, but not guaranteed in minimal/sandboxed CI images where /etc/passwd may be absent). If realpathSync throws on a missing target the test fails for an environmental rather than logic reason. Low severity since the bench harness is Unix-oriented; consider symlinking to a file the test itself creates inside the temp dir.

🟡 LOW jailPath takes an unused _root parameter — bench/src/swe-bench-env.ts

jailPath(_root, p) at line 34 ignores _root; the comment explains it is for 'signature symmetry' with the realpath check. Intentional and documented, but it can trip unused-arg linters and the symmetry justification is weak (callers could pass root only where needed). Not blocking; mention only as a nit.

🟡 LOW list_files escapes realpath jail through repo symlinks — bench/src/swe-bench-env.ts

The new resolveInJail realpath containment check is applied to read_file (line 140) and edit_file (line 160), but list_files (line 106-134) still uses only the string pre-filter (safe=jailPath). If a checked-out repo contains a symlink pointing outside the workspace (e.g., escape -> /etc), list_files("escape") follows it: existsSync follows the symlink, readdirSync lists the target directory's entries. Ve

🟡 LOW list_files is not jail-protected and leaks symlink names (metadata) — bench/src/swe-bench-env.ts

list_files (lines 106-134) uses only the string safe() filter, not resolveInJail. It walks with lstatSync so it does not follow symlinks or read contents, but it does emit symlink entry names (e.g. a repo symlink creds -> /root/.aws/creds appears in the listing). Contents remain protected because read_file/edit_file are realpath-jailed, and this is not a regression (old code behaved identically), so impact is metadata-only. Noted because the PR's theme is jail hardening; a follow-up could skip symlink entries in list_files for defense in depth.

🟡 LOW Thrown task's real spend lost from arm cost accounting — examples/ablation-suite/ablation.ts

Lines 171-223: the try/catch counts a thrown task as unresolved (perTask.push(0)) but the throw escapes before usd += sup.usd / usd += r.usd. For runAgentic the throw is post-hoc (strategy.ts:1048-1053 throws AFTER the supervisor run completed and spent real $), and for the supervisor path a network/quota throw likewise follows real conserved spend. The ablation header asserts '$ are real' for driverSteer arms, but a single thrown task silently drops its full $ from the arm total. Low severity for an example, but contradicts the module's own honesty claim. FIX: wrap each branch in its own try that captures partial spend before re-throwing into t

🟡 LOW No guard on empty train set in optimizeDriverPrompt — examples/ablation-suite/gepa-driver-prompt.ts

Line 78: const trainTasks = await opts.tasks(opts.trainOffset, opts.trainN). If the task generator returns fewer tasks than requested, scenarios can be empty or shorter. With holdoutFraction: 0.34 applied to a near-empty set, the held-out gate has insufficient scenarios to certify the winner. selfImprove likely handles this internally, but a guard (e.g. if (scenarios.length < 2) throw ...) at the call site would prevent silent no-op optimization passes that report a lift of 0.

🟡 LOW Cast-based type recovery from supervise winner output — examples/ablation-suite/self-improving-supervisor.ts

Lines 95-97: const out = result.kind === 'winner' ? (result.out as SurfaceWorkerOut | undefined) : undefined. The type system cannot verify that result.out is a SurfaceWorkerOut — it relies on the co-located invariant that makeWorkerAgent was built by surfaceWorkerSeam. If this function is forked and composed with a different makeWorkerAgent, the cast silently produces a structurally-wrong object. Consider a tag/discriminant on the output or a runtime shape check for robustness.

🟡 LOW deliverable passed to supervise() is dead — makeWorkerAgent override ignores it — examples/ablation-suite/self-improving-supervisor.ts

Line 86 passes deliverable: seam.deliverable to supervise(), but supervise.ts:131-139 only consumes opts.deliverable when deriving the worker seam via workerFromBackend(opts.backend, opts.deliverable) — a path skipped entirely when opts.makeWorkerAgent is supplied (which this seam does). The seam's makeWorkerAgent already encodes verdict: { valid: r.resolved } in the executor result, so worker settlement IS correctly gated to resolved — but the explicit deliverable argument is decorative and misleads anyone reading the call into thinking supervise uses it. Either drop the argument or document that the gate lives in the executor's verd

🟡 LOW progressAnalyst String() coercion may produce [object Object] — examples/ablation-suite/self-improving-supervisor.ts

Lines 59-60: run: async (_kindId: string, trace: unknown) => ({ summary: worker produced: ${String(trace).slice(0, 400)} }). For non-string, non-primitive trace values, String() returns '[object Object]', giving the driver a useless one-liner. The SurfaceWorkerOut.summary is a string so the current surface-worker path is fine, but the analyst registers under the generic AnalystRegistry interface where trace is unknown. Use typeof trace === 'string' ? trace.slice(0, 400) : JSON.stringify(trace).slice(0, 400) to handle arbitrary shape.

🟡 LOW Worker executor ignores spawn-scoped abort signal — examples/ablation-suite/surface-worker.ts

Lines 62-64: async execute() declares zero parameters, but Executor.execute (types.ts:84) receives (task, signal). runAgentic (strategy.ts:1030) does not accept an AbortSignal either. So when the supervisor trips the intensity breaker or exhausts budget and aborts the child (scope.ts:605 childAbort.signal.aborted), the in-flight worker's refine rollout runs to completion anyway. The worker self-limits via its own budget/innerTurns, so this is not a liveness bug, but abort-propagation — the substrate's intended cancellation channel — is absent. Acceptable for an example; worth a one-line comment so readers don't copy it into a long-running

🟡 LOW analyzeOnSettle-requires-analysts invariant documented but not enforced — src/runtime/supervise/coordination-driver.ts

All three option interfaces (DriverAgentOptions:66-67, SupervisorAgentDeps:99-101, SuperviseOptions:85-89) document 'Requires analysts' for analyzeOnSettle, but no construction-time check enforces it. The codebase explicitly fail-loud-checks analogous paired requirements at coordination-driver.ts:171-175 ('extraTools requires executeExtraTool … a silent no-op the house rules forbid') and at coordination-driver.ts:189-193 (maxTurns>=0). A caller wiring { analyzeOnSettle: ['profileRichness'] } without analysts gets silently no analysis (coordination.ts:255 short-circuits on opts.analysts && …) — exactly the silent no-op the existing house-rules comment forbids. Fix: add to the validation block in driverAgent() around [line 171](https://github.com/tangle-network/agent-runtime/blob/34a5b


tangletools · 2026-06-28T06:31:05Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 2 Blocking Findings — 34a5b702

Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 11 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-28T06:31:05Z · immutable trace

…egrity, fail-loud, docs

- coordination-driver: fail loud when analyzeOnSettle is set without analysts (matches the extraTools
  guard — no silent no-op) + a keystone test (fail-loud + analysts/analyzeOnSettle pass-through).
- self-improving-supervisor: the progress analyst read the worker blob via String() → '[object Object]';
  now reads the real SurfaceWorkerOut fields (the up-leg feeds a real summary). Size perWorker so MULTIPLE
  workers fit the pool (the default reserved the whole iteration pool per worker → only one ever spawned,
  defeating spawn-a-refined-worker steering). Return the full conserved spend (tokens/latency), not just $.
- gepa-driver-prompt: return the TRAIN-side optimization cost (totalCostUsd).
- ablation: count GEPA's train cost into the arm $ (fair-cost invariant); capture the supervisor arm's
  real tokens/latency (kill the fake-zero columns); size the supervisor budget for multi-worker steering.
- docs/api regenerated (freshness gate: keystone shifted supervisor-agent line numbers).

Addresses the multi-shot reviewer's 2 HIGH + the correctness/cost/coverage MEDIUMs on 34a5b70.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 0470036a

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-28T07:17:40Z

@drewstone drewstone merged commit 0e1cbab into main Jun 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants