Skip to content

test: reproduce large→small model-switch overflow (GLM-5.2 512k → GPT-5.5 272k)#187

Open
snoproblem wants to merge 1 commit into
cortexkit:masterfrom
snoproblem:reproduce/model-switch-overflow
Open

test: reproduce large→small model-switch overflow (GLM-5.2 512k → GPT-5.5 272k)#187
snoproblem wants to merge 1 commit into
cortexkit:masterfrom
snoproblem:reproduce/model-switch-overflow

Conversation

@snoproblem

@snoproblem snoproblem commented Jun 25, 2026

Copy link
Copy Markdown

Bug

When switching from a large-context model to a smaller one mid-session (e.g. GLM-5.2 512k → GPT-5.5 272k), the historian does not reduce the context to fit the new model before sending the prompt, resulting in Input exceeds context window errors.

Root cause

The model-change branch in transform.ts:502-547 detects the switch and clears lastContextPercentage / lastInputTokens to 0. This suppresses every reduction path on the same pass:

  • Historian trigger (checkCompartmentTrigger): 0% < proactive floor → shouldFire=false
  • 95% emergency block (transform-compartment-phase.ts:328): 0% < 95% → no block
  • Overflow recovery bump (transform.ts:616-628): needsEmergencyRecovery was just cleared → no bump

The oversized prompt — sized for the old model's window — is sent to the new smaller model and rejected. Recovery only arms on the second pass, after the real overflow error fires the event handler's recordOverflowDetected. The user sees the error on the first request.

Fix direction

In the model-change branch (transform.ts:502), before clearing, compare sessionMeta.lastInputTokens (the old model's last measured input tokens) against the new model's context limit (resolveTrustedContextLimit). When oldInputTokens > newContextLimit, call recordOverflowDetected(db, sessionId, newContextLimit, newModelKey) to arm recovery — which makes the existing :616-628 bump-to-95% path fire on the same pass, so historian + emergency drops run before the prompt is sent.

Reproduction

This PR adds a failing test (transform.test.ts) reproducing the scenario:

  • Session at 300k input tokens on GLM-5.2 (58% of 512k — no trigger fired)
  • Switch to GPT-5.5 (272k) — 300k is ~110% of the new window
  • Asserts needsEmergencyRecovery === true and the oversized tool output is dropped

The test currently fails: needsEmergencyRecovery is false and the tool output stays active.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Adds a failing test in packages/plugin/src/hooks/magic-context/transform.test.ts that reproduces an overflow when switching mid-session from a large-context model (GLM-5.2 512k) to a smaller one (GPT-5.5 272k).
The test expects emergency recovery to arm immediately and drop oversized tool output before send; it currently fails, confirming the model-change branch clears usage and allows an oversized prompt through.

Written for commit 7105f3f. Summary will update on new commits.

Review in cubic

Greptile Summary

This PR adds a single failing test that documents the large→small model-switch overflow bug: when a session built up 300k tokens on GLM-5.2 (512k) is switched to GPT-5.5 (272k), the model-change branch in transform.ts clears all pressure state to zero, which suppresses every reduction path so the oversized prompt is sent to the new smaller model unchanged.

  • The test correctly reproduces the bug mechanism — liveModelBySession holds the new model while the last assistant message still carries the old model, triggering the model-change branch, and the session meta carries lastInputTokens: 300_000 against a 272k new window.
  • Two structural problems prevent the test from turning green once the fix lands: the test does not write a synthetic models.json with GPT-5.5's context limit (so resolveTrustedContextLimit returns undefined and the fix's overflow comparison never fires), and the OpenCode DB is seeded with id: \"m-raw-assistant-old\" while the live messages array uses id: \"m-assistant-old\" (a cross-reference mismatch that can affect message-index reconciliation and tool-tag eligibility).

Confidence Score: 2/5

The test is safe to merge in isolation — no production code is touched — but as written it will not turn green when the corresponding fix lands, leaving CI in a permanently failing state for this case.

The test correctly identifies the bug and the right observable outcomes, but cannot fulfill its role as a reproducer: without a models.json fixture carrying GPT-5.5's context limit, the fix mechanism gets undefined for the new model and skips arming recovery entirely. A separate OpenCode DB / live-messages ID mismatch could cause the tool-tag assertion to fail for an unrelated reason even if recovery were somehow armed. Both issues need to be resolved before the test can serve as a reliable regression guard.

packages/plugin/src/hooks/magic-context/transform.test.ts — specifically the new describe block starting at line 2551.

Important Files Changed

Filename Overview
packages/plugin/src/hooks/magic-context/transform.test.ts Adds a failing regression test for the large→small model-switch overflow bug; has two blocking structural issues: no synthetic models.json mock for GPT-5.5 (so resolveTrustedContextLimit returns undefined and the proposed fix never arms recovery), and an OpenCode DB message ID mismatch (m-raw-assistant-old vs m-assistant-old in the live messages array) that could break tool-tag eligibility checks.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant TF as transform.ts
    participant MC as model-change branch (502-547)
    participant OVF as overflow state (DB)
    participant LCU as loadContextUsage
    participant BUMP as 95% bump path (616-628)
    participant ED as emergency-drop

    TF->>MC: "liveModel=GPT-5.5, lastAssistant=GLM-5.2 → mismatch"
    MC->>OVF: clearEmergencyRecovery()
    MC->>MC: contextUsageMap.delete(sessionId)
    Note over MC: (BUG) no recordOverflowDetected called
    Note over MC: (FIX) resolveTrustedContextLimit(GPT-5.5)<br/>needs models.json mock — returns undefined without it
    MC->>LCU: loadContextUsage → 0%
    LCU-->>TF: "percentage=0%"
    TF->>OVF: "getOverflowState → needsEmergencyRecovery=false"
    Note over BUMP: 0% < 95% but needsEmergencyRecovery=false → no bump
    TF->>ED: skipped — no 95% bump
    Note over ED: tool output stays active → prompt sent oversized → overflow error
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant TF as transform.ts
    participant MC as model-change branch (502-547)
    participant OVF as overflow state (DB)
    participant LCU as loadContextUsage
    participant BUMP as 95% bump path (616-628)
    participant ED as emergency-drop

    TF->>MC: "liveModel=GPT-5.5, lastAssistant=GLM-5.2 → mismatch"
    MC->>OVF: clearEmergencyRecovery()
    MC->>MC: contextUsageMap.delete(sessionId)
    Note over MC: (BUG) no recordOverflowDetected called
    Note over MC: (FIX) resolveTrustedContextLimit(GPT-5.5)<br/>needs models.json mock — returns undefined without it
    MC->>LCU: loadContextUsage → 0%
    LCU-->>TF: "percentage=0%"
    TF->>OVF: "getOverflowState → needsEmergencyRecovery=false"
    Note over BUMP: 0% < 95% but needsEmergencyRecovery=false → no bump
    TF->>ED: skipped — no 95% bump
    Note over ED: tool output stays active → prompt sent oversized → overflow error
Loading

Reviews (1): Last reviewed commit: "test: reproduce large→small model-switch..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

…-5.5 272k)

When switching from a large-context model to a smaller one mid-session, the
model-change branch in transform.ts clears lastContextPercentage /
lastInputTokens to 0. This suppresses every reduction path on the same
pass: the historian trigger (0% < proactive floor), the 95% emergency
block, and the overflow-recovery bump (needsEmergencyRecovery was just
cleared). The oversized prompt — sized for the old model's window — is
sent to the new smaller model and rejected with 'Input exceeds context
window'. Recovery only arms on the SECOND pass, after the real overflow
error fires the event handler.

The test asserts the expected post-fix behavior: the model-change branch
should detect oldInputTokens > newContextLimit and arm emergency recovery
(recordOverflowDetected) so the existing 95% bump path runs the historian
+ emergency drops BEFORE the prompt leaves. Currently fails — the fix
lands in transform.ts:502-547.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Re-trigger cubic

Comment on lines +2602 to +2604
const liveModelBySession = new Map<string, { providerID: string; modelID: string }>([
[sessionId, { providerID: "openrouter", modelID: "openai/gpt-5.5" }],
]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing models-dev-cache mock for GPT-5.5 — test will remain red after the fix

The proposed fix (per the PR description) detects overflow by calling resolveTrustedContextLimit(newModel.providerID, newModel.modelID, ...) and comparing the result to sessionMeta.lastInputTokens. resolveTrustedContextLimit reads from getSdkContextLimit, which is backed by the models-dev-cache populated from the OpenCode SDK or a persisted models.json file. In this test, neither is present: useTempDataHome points XDG_CACHE_HOME at an empty temp directory, clearModelsDevCache() runs in afterEach, and no synthetic models.json is written with GPT-5.5's 272k limit. getSdkContextLimit("openrouter", "openai/gpt-5.5") will therefore return undefined, the comparison 300_000 > undefined evaluates to false, recordOverflowDetected is never called, and needsEmergencyRecovery stays false — exactly the same outcome as the unfixed code. The useTempDataHome comment (lines 103–108) explicitly documents the pattern: tests that need model-capability lookup should "write a synthetic models.json into <temp>/opencode/models.json". A fixture entry for openrouter/openai/gpt-5.5 with the correct contextWindow value would let getSdkContextLimit return the 272k limit and allow the test to turn green once the fix lands.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +2577 to +2586
createOpenCodeDbForTransform(sessionId, [
{ id: "m-raw-1", role: "user", text: "earlier work" },
{
id: "m-raw-assistant-old",
role: "assistant",
text: "old model response",
providerID: "openrouter",
modelID: "z-ai/glm-5.2",
},
{ id: "m-raw-2", role: "user", text: "continue" },

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 OpenCode DB message ID mismatches the in-flight messages array

The OpenCode DB is seeded with id: "m-raw-assistant-old" for the assistant row, but the in-flight messages array uses id: "m-assistant-old". Every other test in the file uses the same IDs in both the DB and the live messages so that message-index reconciliation can cross-reference them. Here the reconciler will look for m-assistant-old in the DB, find only m-raw-assistant-old, and treat the in-flight assistant message as unreconciled. Depending on how the historian eligibility / head-tail boundary logic uses the reconciled index, this can cause the tool-output tag to be excluded from eligible history even after the 95% bump fires, so toolTag?.status may stay "active" and the second assertion fails for a reason unrelated to the bug under test.

Comment on lines +2637 to +2653
const messages: TestMessage[] = [
{
info: { id: "m-user", role: "user", sessionID: sessionId },
parts: [{ type: "text", text: "continue" }],
},
{
info: {
id: "m-assistant-old",
role: "assistant",
providerID: "openrouter",
modelID: "z-ai/glm-5.2",
},
parts: [
{ type: "text", text: "ok" },
{ type: "tool", callID: "call-1", state: { output: bigOutput } },
],
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Assistant message in the live messages array is missing sessionID

The user message correctly carries sessionID: sessionId in its info, but the assistant message (m-assistant-old) does not. Several other tests in the file omit sessionID on non-first messages, so this may be harmless if the transform derives the session from the first message. However, the tagger may scope tags by the sessionID found on the message's info, and a missing sessionID on the tool-bearing message could cause the resulting tag to be written without session scope or not be returned by getTagsBySession(db, sessionId) — making the toolTag lookup undefined and the second assertion vacuously false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant