Skip to content

feat(bench): score ToolLLM API-selection subset#411

Merged
drewstone merged 3 commits into
mainfrom
feat/bench-toollm-deterministic
Jun 29, 2026
Merged

feat(bench): score ToolLLM API-selection subset#411
drewstone merged 3 commits into
mainfrom
feat/bench-toollm-deterministic

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

  • score ToolLLM/ToolBench deterministic API-selection labels from relevant APIs
  • keep full ToolEval pass-rate explicitly out of scope because it is LLM-judged/stochastic
  • expose goldArtifact as the expected API-call list for weak-vs-gold calibration
  • fail loud when ToolBench rows do not include deterministic relevant APIs labels

Verification

  • pnpm exec tsx --test bench/src/benchmarks/external-adapters.test.mts — 4/4 pass
  • pnpm exec tsc --noEmit -p tsconfig.json from bench/ — pass
  • pnpm run typecheck — pass
  • pnpm run build — pass
  • pnpm pack --dry-run from bench/ — includes ToolLLM adapter
  • registry smoke for toollm — resolves
  • git diff --check — pass
  • git merge-tree --write-tree origin/main HEAD — clean

Scope note

This is a deterministic ToolLLM subset: API-selection recall over published relevant APIs labels. It is not a full ToolEval pass-rate replacement.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 1 (1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 114.7s (2 bridge agents)
Total 114.7s

💰 Value — sound-with-nits

Flips the ToolLLM adapter from a score-refusing stub into a real deterministic API-selection judge (recall/precision over published relevant APIs), built in the exact grain of agentbench.ts — ship.

  • What it does: Replaces the ToolLLM adapter's refusing judge with a deterministic API-selection scorer: it normalizes each row's relevant APIs into [tool,api] pairs, parses the worker's emitted JSON {api_calls:[...]} (with a substring fallback), and scores recall + precision over the expected set, exposing the expected set as goldArtifact and throwing loud at load/preflight when rows lack relevant APIs.
  • Goals it achieves: Make ToolLLM actually scoreable in agent-bench without invoking the stochastic, LLM-judged ToolEval pass-rate — turning a load-only stub into a contributing deterministic benchmark, and satisfying the goldArtifact self-calibration contract (types.ts:45) so the judge can be verified before spending model tokens.
  • Assessment: Coherent and in-grain. The structure (rowToTask→readMeta→selectRows→fixtures/official split→goldArtifact→JSON-detail BenchScore) is a near-1:1 echo of agentbench.ts:61-161; the deterministicJudge:'api-selection' literal and assertDeterministicSubset guard enforce the repo's fail-loud-on-nondeterminism rule; the score math is correct against the fixture (2 gold APIs: empty→0, gold→1, one-of-two→0.5
  • Better / existing approach: none — this is the right approach. Checked for a shared set-precision/recall helper (hotpotqa.ts:114 tokenF1 is token-bag F1, a different metric — reuse would be wrong; no set-level helper exists and one caller doesn't justify extracting one). The fenced-JSON regex (toollm.ts:153) is the 10th copy across adapters but is the established per-language grain (agentbench/tau2/swe-bench/appworld each tu
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Upgrades the ToolLLM adapter from a score-refusing stub into a working, deterministic API-selection recall scorer over published ToolBench labels, fully wired into the benchmark registry and gate runner.

  • Integration: Fully reachable. Registered at bench/src/adapters.ts:45 (toollm: createToolLlmAdapter); consumed by every gate runner via resolveAdapter('toollm') — runGate (bench/src/gate.ts:330-336) calls preflight/loadTasks/output.parse/judge, corpus-replay.mts:268 and trata-gate.mts:128 call adapter.judge, and goldArtifact feeds weak-vs-gold calibration. Before this PR judge() threw and goldArtifact() ret
  • Fit with existing patterns: Follows the established adapter pattern exactly; no competing implementation exists (grep for api_calls/relevant APIs/precision/recall across bench/src/benchmarks finds only this file plus an unrelated token-F1 in hotpotqa). Respects the deterministic-judge principle in types.ts:3-8 ('no self-authored judge, no invented score noise') — the score is recall over the dataset's PUBLISHED `relevant API
  • Real-world viability: Robust on the paths that matter. JSON extraction handles fenced code blocks and both api_calls/calls shapes (toollm.ts:152-170); substring fallback covers non-JSON model output (toollm.ts:172-178); empty artifact scores 0, gold scores 1, partial scores recall (verified by test at external-adapters.test.mts:81-83). Fails loud on missing labels in BOTH preflight and loadTasks (toollm.ts:127-135), so
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

💰 Value Audit

🟡 Substring fallback makes precision trivially 1.0, so free-text answers can resolve [maintenance] ``

extractCalledApis (toollm.ts:162-179) falls back to expected.filter(...) — which can only return pairs already in expected. So in fallback mode called ⊆ expected and precision is always 1, reducing resolved to recall===1. A verbose free-text answer that names both expected tool/api strings (no valid JSON) scores resolved=true. This is a reasonable generous-extraction choice for deterministic gold calibration and not a correctness bug, but a reviewer should confirm that behavior is inte


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260629T050846Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 063670d5

Review health 100/100 · Reviewer score 86/100 · Confidence 70/100 · 3 findings (3 low)

deepseek: Correctness 86 · Security 86 · Testing 86 · Architecture 86

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟡 LOW Fuzzy text-match fallback can misattribute API calls — bench/src/benchmarks/toollm.ts

extractCalledApis at line 162-178 falls back to substring matching in the agent's full text when JSON parsing fails. Line 173: expected.filter(([tool, api]) => { ... return compactText.includes(...) || (lower.includes(tool.toLowerCase()) && lower.includes(api.toLowerCase())) }). This could return API pairs where the tool/API names appear incidentally or separately in the output, inflating recall with false positives. Since score = recall, these would count as credit. The precision penalty partially offsets this since `r

🟡 LOW Unreachable defensive guard in scoreApiSelection — bench/src/benchmarks/toollm.ts

Line 183-184 throws if meta.relevantApis.length === 0, but line 192 has const score = expected.size === 0 ? 0 : recall. The expected.size === 0 branch is unreachable under normal operation since expected is built from meta.relevantApis.map(apiKey) and normalization never produces empty strings. Harmless as defense-in-depth, but the dead code is misleading. Remove the guard or replace with a comment explaining it's unreachable.

🟡 LOW extractCalledApis may return duplicates; deduplication is caller-side only — bench/src/benchmarks/toollm.ts

extractCalledApis returns Array<[string, string]> from normalizeApiPairs, which (line 90-104) pushes every matching item without deduplication. The expected.filter(...) fallback path also has no dedup guard against duplicates in the expected array. Currently scoreApiSelection handles this via new Set(calledPairs.map(apiKey)) at line 188, so the score is correct. But a future caller of extractCalledApis directly would get duplicates. Consider adding [...new Set(...)] dedup in the return of `extractCalledApis


tangletools · 2026-06-29T05:09:30Z · trace

tangletools
tangletools previously approved these changes Jun 29, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 3 non-blocking findings — 063670d5

Full multi-shot audit completed 2/2 planned shots over 3 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-29T05:09:30Z · immutable trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 6d0615ab

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-29T05:12:08Z

@drewstone drewstone merged commit 67e8512 into main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants