feat(bench): score ToolLLM API-selection subset#411
Conversation
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 1 (1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 114.7s (2 bridge agents) |
| Total | 114.7s |
💰 Value — sound-with-nits
Flips the ToolLLM adapter from a score-refusing stub into a real deterministic API-selection judge (recall/precision over published relevant APIs), built in the exact grain of agentbench.ts — ship.
- What it does: Replaces the ToolLLM adapter's refusing judge with a deterministic API-selection scorer: it normalizes each row's
relevant APIsinto [tool,api] pairs, parses the worker's emitted JSON{api_calls:[...]}(with a substring fallback), and scores recall + precision over the expected set, exposing the expected set as goldArtifact and throwing loud at load/preflight when rows lackrelevant APIs. - Goals it achieves: Make ToolLLM actually scoreable in agent-bench without invoking the stochastic, LLM-judged ToolEval pass-rate — turning a load-only stub into a contributing deterministic benchmark, and satisfying the goldArtifact self-calibration contract (types.ts:45) so the judge can be verified before spending model tokens.
- Assessment: Coherent and in-grain. The structure (rowToTask→readMeta→selectRows→fixtures/official split→goldArtifact→JSON-detail BenchScore) is a near-1:1 echo of agentbench.ts:61-161; the deterministicJudge:'api-selection' literal and assertDeterministicSubset guard enforce the repo's fail-loud-on-nondeterminism rule; the score math is correct against the fixture (2 gold APIs: empty→0, gold→1, one-of-two→0.5
- Better / existing approach: none — this is the right approach. Checked for a shared set-precision/recall helper (hotpotqa.ts:114 tokenF1 is token-bag F1, a different metric — reuse would be wrong; no set-level helper exists and one caller doesn't justify extracting one). The fenced-JSON regex (toollm.ts:153) is the 10th copy across adapters but is the established per-language grain (agentbench/tau2/swe-bench/appworld each tu
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Upgrades the ToolLLM adapter from a score-refusing stub into a working, deterministic API-selection recall scorer over published ToolBench labels, fully wired into the benchmark registry and gate runner.
- Integration: Fully reachable. Registered at bench/src/adapters.ts:45 (
toollm: createToolLlmAdapter); consumed by every gate runner viaresolveAdapter('toollm')— runGate (bench/src/gate.ts:330-336) calls preflight/loadTasks/output.parse/judge, corpus-replay.mts:268 and trata-gate.mts:128 call adapter.judge, and goldArtifact feeds weak-vs-gold calibration. Before this PR judge() threw and goldArtifact() ret - Fit with existing patterns: Follows the established adapter pattern exactly; no competing implementation exists (grep for api_calls/relevant APIs/precision/recall across bench/src/benchmarks finds only this file plus an unrelated token-F1 in hotpotqa). Respects the deterministic-judge principle in types.ts:3-8 ('no self-authored judge, no invented score noise') — the score is recall over the dataset's PUBLISHED `relevant API
- Real-world viability: Robust on the paths that matter. JSON extraction handles fenced code blocks and both api_calls/calls shapes (toollm.ts:152-170); substring fallback covers non-JSON model output (toollm.ts:172-178); empty artifact scores 0, gold scores 1, partial scores recall (verified by test at external-adapters.test.mts:81-83). Fails loud on missing labels in BOTH preflight and loadTasks (toollm.ts:127-135), so
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
💰 Value Audit
🟡 Substring fallback makes precision trivially 1.0, so free-text answers can resolve [maintenance] ``
extractCalledApis (toollm.ts:162-179) falls back to
expected.filter(...)— which can only return pairs already inexpected. So in fallback modecalled ⊆ expectedand precision is always 1, reducingresolvedtorecall===1. A verbose free-text answer that names both expected tool/api strings (no valid JSON) scores resolved=true. This is a reasonable generous-extraction choice for deterministic gold calibration and not a correctness bug, but a reviewer should confirm that behavior is inte
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 3 non-blocking findings — 063670d5
Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-29T05:09:30Z · immutable trace
… into feat/bench-toollm-deterministic
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 6d0615ab
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-29T05:12:08Z
Summary
relevant APIsrelevant APIslabelsVerification
pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts— 4/4 passpnpm exec tsc --noEmit -p tsconfig.jsonfrombench/— passpnpm run typecheck— passpnpm run build— passpnpm pack --dry-runfrombench/— includes ToolLLM adaptertoollm— resolvesgit diff --check— passgit merge-tree --write-tree origin/main HEAD— cleanScope note
This is a deterministic ToolLLM subset: API-selection recall over published
relevant APIslabels. It is not a full ToolEval pass-rate replacement.