feat(bench): score ToolLLM API-selection subset by drewstone · Pull Request #411 · tangle-network/agent-runtime

drewstone · 2026-06-29T05:05:00Z

Summary

score ToolLLM/ToolBench deterministic API-selection labels from relevant APIs
keep full ToolEval pass-rate explicitly out of scope because it is LLM-judged/stochastic
expose goldArtifact as the expected API-call list for weak-vs-gold calibration
fail loud when ToolBench rows do not include deterministic relevant APIs labels

Verification

pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts — 4/4 pass
pnpm exec tsc --noEmit -p tsconfig.json from bench/ — pass
pnpm run typecheck — pass
pnpm run build — pass
pnpm pack --dry-run from bench/ — includes ToolLLM adapter
registry smoke for toollm — resolves
git diff --check — pass
git merge-tree --write-tree origin/main HEAD — clean

Scope note

This is a deterministic ToolLLM subset: API-selection recall over published relevant APIs labels. It is not a full ToolEval pass-rate replacement.

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	1 (1 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	114.7s (2 bridge agents)
Total	114.7s

💰 Value — sound-with-nits

Flips the ToolLLM adapter from a score-refusing stub into a real deterministic API-selection judge (recall/precision over published relevant APIs), built in the exact grain of agentbench.ts — ship.

What it does: Replaces the ToolLLM adapter's refusing judge with a deterministic API-selection scorer: it normalizes each row's relevant APIs into [tool,api] pairs, parses the worker's emitted JSON {api_calls:[...]} (with a substring fallback), and scores recall + precision over the expected set, exposing the expected set as goldArtifact and throwing loud at load/preflight when rows lack relevant APIs.
Goals it achieves: Make ToolLLM actually scoreable in agent-bench without invoking the stochastic, LLM-judged ToolEval pass-rate — turning a load-only stub into a contributing deterministic benchmark, and satisfying the goldArtifact self-calibration contract (types.ts:45) so the judge can be verified before spending model tokens.
Assessment: Coherent and in-grain. The structure (rowToTask→readMeta→selectRows→fixtures/official split→goldArtifact→JSON-detail BenchScore) is a near-1:1 echo of agentbench.ts:61-161; the deterministicJudge:'api-selection' literal and assertDeterministicSubset guard enforce the repo's fail-loud-on-nondeterminism rule; the score math is correct against the fixture (2 gold APIs: empty→0, gold→1, one-of-two→0.5
Better / existing approach: none — this is the right approach. Checked for a shared set-precision/recall helper (hotpotqa.ts:114 tokenF1 is token-bag F1, a different metric — reuse would be wrong; no set-level helper exists and one caller doesn't justify extracting one). The fenced-JSON regex (toollm.ts:153) is the 10th copy across adapters but is the established per-language grain (agentbench/tau2/swe-bench/appworld each tu
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 2
Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Upgrades the ToolLLM adapter from a score-refusing stub into a working, deterministic API-selection recall scorer over published ToolBench labels, fully wired into the benchmark registry and gate runner.

Integration: Fully reachable. Registered at bench/src/adapters.ts:45 (toollm: createToolLlmAdapter); consumed by every gate runner via resolveAdapter('toollm') — runGate (bench/src/gate.ts:330-336) calls preflight/loadTasks/output.parse/judge, corpus-replay.mts:268 and trata-gate.mts:128 call adapter.judge, and goldArtifact feeds weak-vs-gold calibration. Before this PR judge() threw and goldArtifact() ret
Fit with existing patterns: Follows the established adapter pattern exactly; no competing implementation exists (grep for api_calls/relevant APIs/precision/recall across bench/src/benchmarks finds only this file plus an unrelated token-F1 in hotpotqa). Respects the deterministic-judge principle in types.ts:3-8 ('no self-authored judge, no invented score noise') — the score is recall over the dataset's PUBLISHED `relevant API
Real-world viability: Robust on the paths that matter. JSON extraction handles fenced code blocks and both api_calls/calls shapes (toollm.ts:152-170); substring fallback covers non-JSON model output (toollm.ts:172-178); empty artifact scores 0, gold scores 1, partial scores recall (verified by test at external-adapters.test.mts:81-83). Fails loud on missing labels in BOTH preflight and loadTasks (toollm.ts:127-135), so
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

💰 Value Audit

🟡 Substring fallback makes precision trivially 1.0, so free-text answers can resolve [maintenance] ``

extractCalledApis (toollm.ts:162-179) falls back to expected.filter(...) — which can only return pairs already in expected. So in fallback mode called ⊆ expected and precision is always 1, reducing resolved to recall===1. A verbose free-text answer that names both expected tool/api strings (no valid JSON) scores resolved=true. This is a reasonable generous-extraction choice for deterministic gold calibration and not a correctness bug, but a reviewer should confirm that behavior is inte

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260629T050846Z}

tangletools · 2026-06-29T05:09:32Z

✅ No Blockers — `063670d5`

Review health 100/100 · Reviewer score 86/100 · Confidence 70/100 · 3 findings (3 low)

deepseek: Correctness 86 · Security 86 · Testing 86 · Architecture 86

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟡 LOW Fuzzy text-match fallback can misattribute API calls — bench/src/benchmarks/toollm.ts

extractCalledApis at line 162-178 falls back to substring matching in the agent's full text when JSON parsing fails. Line 173: expected.filter(([tool, api]) => { ... return compactText.includes(...) || (lower.includes(tool.toLowerCase()) && lower.includes(api.toLowerCase())) }). This could return API pairs where the tool/API names appear incidentally or separately in the output, inflating recall with false positives. Since score = recall, these would count as credit. The precision penalty partially offsets this since `r

🟡 LOW Unreachable defensive guard in scoreApiSelection — bench/src/benchmarks/toollm.ts

Line 183-184 throws if meta.relevantApis.length === 0, but line 192 has const score = expected.size === 0 ? 0 : recall. The expected.size === 0 branch is unreachable under normal operation since expected is built from meta.relevantApis.map(apiKey) and normalization never produces empty strings. Harmless as defense-in-depth, but the dead code is misleading. Remove the guard or replace with a comment explaining it's unreachable.

🟡 LOW extractCalledApis may return duplicates; deduplication is caller-side only — bench/src/benchmarks/toollm.ts

extractCalledApis returns Array<[string, string]> from normalizeApiPairs, which (line 90-104) pushes every matching item without deduplication. The expected.filter(...) fallback path also has no dedup guard against duplicates in the expected array. Currently scoreApiSelection handles this via new Set(calledPairs.map(apiKey)) at line 188, so the score is correct. But a future caller of extractCalledApis directly would get duplicates. Consider adding [...new Set(...)] dedup in the return of `extractCalledApis

_{tangletools · 2026-06-29T05:09:30Z · trace}

tangletools

✅ Approved — 3 non-blocking findings — `063670d5`

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-29T05:09:30Z · immutable trace}

… into feat/bench-toollm-deterministic

tangletools

✅ Refreshed approval after new commits — `6d0615ab`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-29T05:12:08Z}

feat(bench): score ToolLLM API-selection subset

063670d

tangletools reviewed Jun 29, 2026

View reviewed changes

tangletools previously approved these changes Jun 29, 2026

View reviewed changes

drewstone added 2 commits June 28, 2026 23:11

feat(bench): score ToolLLM API-selection subset

3808cd4

Merge remote-tracking branch 'origin/feat/bench-toollm-deterministic'…

6d0615a

… into feat/bench-toollm-deterministic

drewstone dismissed tangletools’s stale review via 6d0615a June 29, 2026 05:12

tangletools approved these changes Jun 29, 2026

View reviewed changes

drewstone merged commit 67e8512 into main Jun 29, 2026
1 check passed

drewstone mentioned this pull request Jun 29, 2026

feat(bench): add external benchmark adapters for WebArena, tau-bench, DABStep, AgentBench, and ToolLLM #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bench): score ToolLLM API-selection subset#411

feat(bench): score ToolLLM API-selection subset#411
drewstone merged 3 commits into
mainfrom
feat/bench-toollm-deterministic

drewstone commented Jun 29, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 29, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 29, 2026

Summary

Verification

Scope note

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound

💰 Value Audit

Uh oh!

tangletools commented Jun 29, 2026

✅ No Blockers — 063670d5

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 3 non-blocking findings — 063670d5

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 6d0615ab

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `063670d5`

✅ Approved — 3 non-blocking findings — `063670d5`

✅ Refreshed approval after new commits — `6d0615ab`