fix(web_search): make "check online" actually work on local models (Auto + 7B)#66
Merged
Conversation
…lucinating on "check online"
BUG (user-reported): with Auto -> a local model, "Check online and tell me when spacex ipo happened" did a
CODEBASE/file search and then fabricated a confident false fact ("SpaceX IPO'd Dec 5 2015"; SpaceX has
never IPO'd). Root cause (4-lens investigation): (1) web_search/browse_url are curated OUT of the local
toolset for ALL local models, so the model can't browse; (2) "check online" wasn't recognized as web intent
(keyword lists matched "search online"/"check the web" but not "check online"), so it fell through to a
synthesized search_for_files; (3) the local prompt claimed it could "fetch current/web info" while the tools
were removed; (4) no anti-hallucination guard, so it invented an answer.
Fix (per the chosen direction -- enable web on CAPABLE local models):
- New CAPABLE_LOCAL_TOOLSET = COMPACT_LOCAL_TOOLSET + web_search + browse_url, selected by localToolsetFor
(prompts.ts). A capable local coder (>=7B, isCapableLocalCoder) gets the web tools at BOTH the prompt
catalog (availableTools/systemToolsXMLPrompt/chat_systemMessage_local, threaded from
convertToLLMMessageService) AND the execution chokepoint + offered-list (chatThreadService _runToolCall,
threaded via recomputeModelState -> all 4 call sites). Small local models stay on COMPACT (no web).
- Recognize "check online"/"go online"/"look online"/"search the internet"/"on the internet" as web intent
in the task-type detector + tool-synthesis (chatThreadService) and the pure WEB_QUERY_WORDS
(common/toolSynthesisDecision.ts).
- Prompt: the local system message only claims web access when the web tools are actually offered (capable
coder); a small local model is told it does NOT have web access and to say so instead of browsing/searching.
- Anti-hallucination guard added to the local prompt: if a tool returns nothing or you lack a source/tool,
say so -- never fabricate.
The same isCapableLocalCoder signal (name + ollama param_size) is computed identically on both the prompt
and chokepoint sides, so offered == executable. tsgo 0; common suite 888 -> 889 (+1: capable-toolset gate).
Live re-test pending on the running build (a >=7B coder should now web-search "check online"; a small local
model should refuse gracefully).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fresh facts over stale training memory
BUG-2 (user-reported, after the BUG-1 routing fix): asked to check online when SpaceX IPO'd, the agent DID
search the web and the results were actually CORRECT and current ("SpaceX ... IPO on June 12, 2026 ... $135/
share"), but the local 7B model dismissed them as "unrelated" and answered from STALE training memory ("June
29 2019, $42" -- all false; SpaceX had not IPO'd before 2026). The search backend was fine (Method 1 Instant-
Answer returns empty -> falls through to Method 2 DDG-HTML which returns the right results); the failure was
GROUNDING -- the model overrode fresh retrieval with parametric memory, worsened by the prompt-injection fence
labeling results "UNTRUSTED" (a weak model over-reads that as "don't trust the facts").
Fix: add a GROUNDING preamble to the web_search + browse_url tool RESULTS (stringOfResult): treat the FACTS in
current web results as authoritative and PREFER them over training knowledge; answer ONLY from them; if the
answer isn't present, say you couldn't find it instead of guessing. Crucially it distinguishes FACTS (use them)
from INSTRUCTIONS (still don't obey, per the unchanged injection fence) -- so the anti-injection defense is NOT
weakened. The empty-results string now also tells the model to say it couldn't find it rather than answer from
memory.
This is the harness-side lever; a 7B model may still occasionally override (model-honesty limit) -- the robust
fix (constrained/structured decoding, stronger model) is tracked in the agent-mode modernization prompt. tsgo
0; no test delta (browser tool-result formatting). Live re-test pending.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
web_search returned titles+URLs but "No snippet available" (or URL-encoded redirect-URL garbage) for most results, so the agent received no real facts and hallucinated answers (BUG-2: "check online ... when did SpaceX IPO" produced a fabricated "June 29, 2019, $42 per share, SPACX"). Root cause: the renderer cannot fetch html.duckduckgo.com directly (CORS), so web_search routes through webContentExtractorService, which returns the page as accessibility-tree markdown (NOT raw HTML). The old parser walked raw character ranges and reconstructed `[title](decoded-url)` to locate snippets, but the markdown holds DDG's *redirect* URLs and interleaves the title / displayed-url / snippet links, with footnote markers ([12]) inside the snippet text -- so indexOf failed and the regex broke, yielding empty or garbage snippets. Fix: parse DDG's very regular per-result structure (## heading link = title; longest prose link-text = snippet; uddg= param = canonical url) with a footnote-aware link matcher. Extracted to a pure common/webSearchParse.ts so it is node-testable; 8 unit tests pin a golden fixture captured from the real extractor output (clean title/url/snippet, no redirect/encoded/footnote/markdown leakage, entity decoding, displayed-url rejection, ad filtering, maxResults). Live-verified over CDP: "when did SpaceX IPO happen" and "latest stable node.js LTS version" each return 5 clean results with correct, current facts (SpaceX IPO June 12, 2026; Node.js 24.11.0 LTS). 897 node tests pass (was 889; +8), tsgo 0, cdp-smoke 11/11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt answers correctly
Even after the parser fix, the agent still answered "check online ... when did
SpaceX IPO" wrong on local models. Live CDP instrumentation of the real agent
loop (not the tool in isolation) found TWO behavioral bugs downstream of the
now-correct tool:
1. Bad SYNTHESIZED query. When the model gives up ("I do not know.") the harness
synthesizes a web_search from intent. The query was built by extractKeywords =
first 5 words after a tiny stop-word list, so "check online and tell me when
SpaceX IPO'd" became "check online and tell when" -> DuckDuckGo returned
"check online" (DVLA / vehicle-tax) results and the agent honestly reported it
found nothing -- the real subject "SpaceX IPO" was dropped (past word 5).
Fix: pure common/webSearchQuery.ts extractWebSearchQuery() strips the
web-intent triggers + command/politeness framing and keeps the SUBJECT.
2. Self-limited result count. When the model emitted its OWN call it used k=1;
that single snippet (a price/valuation blurb) had no date and the model
FABRICATED "May 15, 2026". Fix: clamp web_search results to a floor of 5
(cap 10) so the answer-bearing snippet (Wikipedia "...IPO on June 12, 2026...")
is present.
Live-verified over CDP (qwen2.5-coder:7b, Agent mode, real chat). Before: bad
query -> DVLA results -> "could not find"; k=1 -> fabricated "May 15, 2026".
After: query "SpaceX IPO date" -> 5 results (hasJune12=true reaches the model)
-> grounded answer "SpaceX completed its IPO on June 12, 2026 ... $135 per share."
13 new unit tests (incl. the exact regression case); 897->909 node tests, tsgo 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Auto->llama3) Testing "check online ... SpaceX IPO" in AUTO mode surfaced a distinct bug from the qwen2.5-coder path. Auto resolved to a capable GENERAL model (llama3:8b), which was DENIED web_search and fell back to stale training knowledge: Search the web "SpaceX IPO date" Error: The web_search tool isn't available for this model. Use one of: read_file, ... -> "SpaceX has not gone public through an IPO..." (WRONG) Root cause: web tools (web_search/browse_url) were gated on isCapableLocalCoder -- a CODER >=7B (codingModelScoreBonus>=25 AND >=7B). llama3:8b is capable but not a coder, so it got the COMPACT toolset (no web). Web search is a GENERAL capability, not coding-specific -- any sufficiently large local model should have it. Fix: new size-only gate isCapableLocalModel (>=7B, or unnumbered/flagship tag), used for the web-tool toolset decision at BOTH the prompt catalog (convertToLLMMessageService) AND the execution chokepoint (chatThreadService). Renamed the threaded toolset boolean isCapableLocalCoder -> isCapableLocalModel through prompts.ts/chatThreadService/convertToLLMMessageService for honesty; isCapableLocalCoder the FUNCTION stays (onboarding/routing still want a coder). Live-verified over CDP (fresh thread each, real chat): - Auto -> qwen2.5-coder:7b (7.6B): gate true, answers "June 12, 2026" correctly. - llama3:latest (8.0B, GENERAL, the original failure): gate now true at BOTH the chokepoint and the prompt; searches "SpaceX IPO date" (5 results) and answers "Based on the search results ... IPO on June 12, 2026 ... ticker SPCX." CORRECT. - llama3.2:3b (3.2B): gate correctly FALSE -- small models stay web-less. 909->913 node tests (+4 isCapableLocalModel cases incl. the llama3:8b regression), tsgo 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Asking the agent to "check online and tell me when SpaceX IPO'd" on a local model (both Auto and an explicit qwen2.5-coder:7b) returned wrong, hallucinated answers (e.g. fabricated "June 29, 2019" / "May 15, 2026", or "SpaceX has not gone public"). Found and fixed via end-to-end instrumentation of the real agent loop + live CDP verification.
Five stacked root causes (each its own commit)
"No snippet available"/ URL-encoded garbage, so the model had no facts. The renderer can't fetch DuckDuckGo directly (CORS), so results come back as accessibility-tree markdown (not raw HTML); the old parser broke on DDG redirect URLs + footnote markers. Rewrote it (pure, node-testedwebSearchParse.ts)."check online and tell when"→ DVLA results); and when the model self-queried it asked for 1 result whose snippet lacked the date → fabrication. New purewebSearchQuery.tskeeps the subject; web_search now floors results to 5 (cap 10).isCapableLocalModel(≥7B, coder or general) at both the prompt and chokepoint gates.Verification (live, over CDP, real chat)
Tests / quality
webSearchParse,webSearchQuery, andisCapableLocalModel(incl. the exact regression cases), on golden fixtures captured from the real extractor output.Notes
🤖 Generated with Claude Code