fix(web_search): make "check online" actually work on local models (Auto + 7B) by Pterjudin · Pull Request #66 · OpenCortexIDE/cortexide

Pterjudin · 2026-06-18T19:54:30Z

Problem

Asking the agent to "check online and tell me when SpaceX IPO'd" on a local model (both Auto and an explicit qwen2.5-coder:7b) returned wrong, hallucinated answers (e.g. fabricated "June 29, 2019" / "May 15, 2026", or "SpaceX has not gone public"). Found and fixed via end-to-end instrumentation of the real agent loop + live CDP verification.

Five stacked root causes (each its own commit)

Web tools curated out for all local models → "check online" did a codebase search and hallucinated. Enabled web tools for capable local models at the prompt catalog + execution chokepoint; added "check online" (+variants) to the web-intent detectors.
Untrusted results dismissed → added a GROUNDING preamble to web_search/browse_url results so the model prefers fresh facts over stale training memory (without weakening the prompt-injection fence).
Snippet parser broken → web_search returned titles+URLs but "No snippet available" / URL-encoded garbage, so the model had no facts. The renderer can't fetch DuckDuckGo directly (CORS), so results come back as accessibility-tree markdown (not raw HTML); the old parser broke on DDG redirect URLs + footnote markers. Rewrote it (pure, node-tested webSearchParse.ts).
Bad synthesized query + k=1 → when the model gives up, the harness synthesized a query from the first 5 words ("check online and tell when" → DVLA results); and when the model self-queried it asked for 1 result whose snippet lacked the date → fabrication. New pure webSearchQuery.ts keeps the subject; web_search now floors results to 5 (cap 10).
Auto → general model denied web tools → web tools were gated on being a capable coder; Auto resolves general questions to a general model (llama3:8b) which was then denied web_search. New size-only gate isCapableLocalModel (≥7B, coder or general) at both the prompt and chokepoint gates.

Verification (live, over CDP, real chat)

Model	Result
Auto → qwen2.5-coder:7b (7.6B)	✅ "…IPO on June 12, 2026 … $135/share"
llama3:latest (8.0B, general)	✅ searches "SpaceX IPO date" → 5 results → grounded "June 12, 2026 … ticker SPCX"
llama3.2:3b (3.2B)	✅ correctly stays web-less (small models excluded by design)

Tests / quality

+31 new unit tests across webSearchParse, webSearchQuery, and isCapableLocalModel (incl. the exact regression cases), on golden fixtures captured from the real extractor output.
Full node suite green (913 passing), tsgo 0, cdp-smoke 11/11. Behavior-preserving extraction of the parser; every claim maps to tested code.

Notes

No telemetry; secret redaction and local-only egress respected (DuckDuckGo fetch already runs through the main-process web extractor).
Known follow-up (not in this PR): in long conversations, tool results get the highest context-trim weight and can be truncated/dropped — separate latent reliability item.

🤖 Generated with Claude Code

…lucinating on "check online" BUG (user-reported): with Auto -> a local model, "Check online and tell me when spacex ipo happened" did a CODEBASE/file search and then fabricated a confident false fact ("SpaceX IPO'd Dec 5 2015"; SpaceX has never IPO'd). Root cause (4-lens investigation): (1) web_search/browse_url are curated OUT of the local toolset for ALL local models, so the model can't browse; (2) "check online" wasn't recognized as web intent (keyword lists matched "search online"/"check the web" but not "check online"), so it fell through to a synthesized search_for_files; (3) the local prompt claimed it could "fetch current/web info" while the tools were removed; (4) no anti-hallucination guard, so it invented an answer. Fix (per the chosen direction -- enable web on CAPABLE local models): - New CAPABLE_LOCAL_TOOLSET = COMPACT_LOCAL_TOOLSET + web_search + browse_url, selected by localToolsetFor (prompts.ts). A capable local coder (>=7B, isCapableLocalCoder) gets the web tools at BOTH the prompt catalog (availableTools/systemToolsXMLPrompt/chat_systemMessage_local, threaded from convertToLLMMessageService) AND the execution chokepoint + offered-list (chatThreadService _runToolCall, threaded via recomputeModelState -> all 4 call sites). Small local models stay on COMPACT (no web). - Recognize "check online"/"go online"/"look online"/"search the internet"/"on the internet" as web intent in the task-type detector + tool-synthesis (chatThreadService) and the pure WEB_QUERY_WORDS (common/toolSynthesisDecision.ts). - Prompt: the local system message only claims web access when the web tools are actually offered (capable coder); a small local model is told it does NOT have web access and to say so instead of browsing/searching. - Anti-hallucination guard added to the local prompt: if a tool returns nothing or you lack a source/tool, say so -- never fabricate. The same isCapableLocalCoder signal (name + ollama param_size) is computed identically on both the prompt and chokepoint sides, so offered == executable. tsgo 0; common suite 888 -> 889 (+1: capable-toolset gate). Live re-test pending on the running build (a >=7B coder should now web-search "check online"; a small local model should refuse gracefully). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fresh facts over stale training memory BUG-2 (user-reported, after the BUG-1 routing fix): asked to check online when SpaceX IPO'd, the agent DID search the web and the results were actually CORRECT and current ("SpaceX ... IPO on June 12, 2026 ... $135/ share"), but the local 7B model dismissed them as "unrelated" and answered from STALE training memory ("June 29 2019, $42" -- all false; SpaceX had not IPO'd before 2026). The search backend was fine (Method 1 Instant- Answer returns empty -> falls through to Method 2 DDG-HTML which returns the right results); the failure was GROUNDING -- the model overrode fresh retrieval with parametric memory, worsened by the prompt-injection fence labeling results "UNTRUSTED" (a weak model over-reads that as "don't trust the facts"). Fix: add a GROUNDING preamble to the web_search + browse_url tool RESULTS (stringOfResult): treat the FACTS in current web results as authoritative and PREFER them over training knowledge; answer ONLY from them; if the answer isn't present, say you couldn't find it instead of guessing. Crucially it distinguishes FACTS (use them) from INSTRUCTIONS (still don't obey, per the unchanged injection fence) -- so the anti-injection defense is NOT weakened. The empty-results string now also tells the model to say it couldn't find it rather than answer from memory. This is the harness-side lever; a 7B model may still occasionally override (model-honesty limit) -- the robust fix (constrained/structured decoding, stronger model) is tracked in the agent-mode modernization prompt. tsgo 0; no test delta (browser tool-result formatting). Live re-test pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

web_search returned titles+URLs but "No snippet available" (or URL-encoded redirect-URL garbage) for most results, so the agent received no real facts and hallucinated answers (BUG-2: "check online ... when did SpaceX IPO" produced a fabricated "June 29, 2019, $42 per share, SPACX"). Root cause: the renderer cannot fetch html.duckduckgo.com directly (CORS), so web_search routes through webContentExtractorService, which returns the page as accessibility-tree markdown (NOT raw HTML). The old parser walked raw character ranges and reconstructed `[title](decoded-url)` to locate snippets, but the markdown holds DDG's *redirect* URLs and interleaves the title / displayed-url / snippet links, with footnote markers ([12]) inside the snippet text -- so indexOf failed and the regex broke, yielding empty or garbage snippets. Fix: parse DDG's very regular per-result structure (## heading link = title; longest prose link-text = snippet; uddg= param = canonical url) with a footnote-aware link matcher. Extracted to a pure common/webSearchParse.ts so it is node-testable; 8 unit tests pin a golden fixture captured from the real extractor output (clean title/url/snippet, no redirect/encoded/footnote/markdown leakage, entity decoding, displayed-url rejection, ad filtering, maxResults). Live-verified over CDP: "when did SpaceX IPO happen" and "latest stable node.js LTS version" each return 5 clean results with correct, current facts (SpaceX IPO June 12, 2026; Node.js 24.11.0 LTS). 897 node tests pass (was 889; +8), tsgo 0, cdp-smoke 11/11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt answers correctly Even after the parser fix, the agent still answered "check online ... when did SpaceX IPO" wrong on local models. Live CDP instrumentation of the real agent loop (not the tool in isolation) found TWO behavioral bugs downstream of the now-correct tool: 1. Bad SYNTHESIZED query. When the model gives up ("I do not know.") the harness synthesizes a web_search from intent. The query was built by extractKeywords = first 5 words after a tiny stop-word list, so "check online and tell me when SpaceX IPO'd" became "check online and tell when" -> DuckDuckGo returned "check online" (DVLA / vehicle-tax) results and the agent honestly reported it found nothing -- the real subject "SpaceX IPO" was dropped (past word 5). Fix: pure common/webSearchQuery.ts extractWebSearchQuery() strips the web-intent triggers + command/politeness framing and keeps the SUBJECT. 2. Self-limited result count. When the model emitted its OWN call it used k=1; that single snippet (a price/valuation blurb) had no date and the model FABRICATED "May 15, 2026". Fix: clamp web_search results to a floor of 5 (cap 10) so the answer-bearing snippet (Wikipedia "...IPO on June 12, 2026...") is present. Live-verified over CDP (qwen2.5-coder:7b, Agent mode, real chat). Before: bad query -> DVLA results -> "could not find"; k=1 -> fabricated "May 15, 2026". After: query "SpaceX IPO date" -> 5 results (hasJune12=true reaches the model) -> grounded answer "SpaceX completed its IPO on June 12, 2026 ... $135 per share." 13 new unit tests (incl. the exact regression case); 897->909 node tests, tsgo 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Auto->llama3) Testing "check online ... SpaceX IPO" in AUTO mode surfaced a distinct bug from the qwen2.5-coder path. Auto resolved to a capable GENERAL model (llama3:8b), which was DENIED web_search and fell back to stale training knowledge: Search the web "SpaceX IPO date" Error: The web_search tool isn't available for this model. Use one of: read_file, ... -> "SpaceX has not gone public through an IPO..." (WRONG) Root cause: web tools (web_search/browse_url) were gated on isCapableLocalCoder -- a CODER >=7B (codingModelScoreBonus>=25 AND >=7B). llama3:8b is capable but not a coder, so it got the COMPACT toolset (no web). Web search is a GENERAL capability, not coding-specific -- any sufficiently large local model should have it. Fix: new size-only gate isCapableLocalModel (>=7B, or unnumbered/flagship tag), used for the web-tool toolset decision at BOTH the prompt catalog (convertToLLMMessageService) AND the execution chokepoint (chatThreadService). Renamed the threaded toolset boolean isCapableLocalCoder -> isCapableLocalModel through prompts.ts/chatThreadService/convertToLLMMessageService for honesty; isCapableLocalCoder the FUNCTION stays (onboarding/routing still want a coder). Live-verified over CDP (fresh thread each, real chat): - Auto -> qwen2.5-coder:7b (7.6B): gate true, answers "June 12, 2026" correctly. - llama3:latest (8.0B, GENERAL, the original failure): gate now true at BOTH the chokepoint and the prompt; searches "SpaceX IPO date" (5 results) and answers "Based on the search results ... IPO on June 12, 2026 ... ticker SPCX." CORRECT. - llama3.2:3b (3.2B): gate correctly FALSE -- small models stay web-less. 909->913 node tests (+4 isCapableLocalModel cases incl. the llama3:8b regression), tsgo 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tajudeen and others added 5 commits June 18, 2026 20:53

Pterjudin merged commit 8375fe9 into main Jun 20, 2026
11 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(web_search): make "check online" actually work on local models (Auto + 7B)#66

fix(web_search): make "check online" actually work on local models (Auto + 7B)#66
Pterjudin merged 5 commits into
mainfrom
fix/web-search-local-models

Pterjudin commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Pterjudin commented Jun 18, 2026

Problem

Five stacked root causes (each its own commit)

Verification (live, over CDP, real chat)

Tests / quality

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant