fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0#92
Open
dmmdea wants to merge 2 commits into
Open
Conversation
…d of silently scoring 0 A nightly sleep cycle could run for weeks emitting held-out 0.0 -> 0.0 (gate reject, zero edits), indistinguishable from "nothing to learn", when the real cause was the codex backend returning an error (expired auth / model unsupported on the account / outdated CLI) that got scored as a failed rollout. backend (CodexCliBackend): - split _call into _call_once + a retry wrapper: transient empties/timeouts are retried instead of silently returning "" (mirrors AzureOpenAIBackend's guard); - on a non-zero exit, surface the reason via last_call_error and return "" rather than leaking the CLI error text as if it were a model response; - fail fast (no retries) on fatal auth/model/version errors (401, refresh_token_reused, token_expired, "not supported when using Codex with a ChatGPT account", "requires a newer version of Codex"). backend (CliBackend.reflect): retain last_reflect_raw so a no-edits night is diagnosable. consolidate: ConsolidationResult now carries per-task held-out detail (response, hard/soft, fail_reason) + reflect_raw + call_error. cycle: write diagnostics.json per cycle so a 0.0 night self-explains instead of being a black box. tests: 4 new (retry-not-silent-zero, auth-error-surfaced-not-scored, holdout-detail, reflect-raw). Also gitignore the .skillopt-sleep/ runtime dir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lout path Follow-up from a fresh-context review of the prior commit: CodexCliBackend.attempt_with_tools (the rollout path for tool-requiring tasks) ran codex exec inline, swallowed all exceptions, and never set last_call_error — so an auth/model/version failure on the tool path still produced a silent empty->0 with no diagnostic signal, the exact failure class the prior commit fixed for the _call path. Now it surfaces timeout/exception/non-zero-exit via last_call_error (response stays empty; never leaks the CLI error text), so a failed tool rollout shows up in diagnostics.json. Adds a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
@dmmdea please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A nightly sleep cycle could run for a long time emitting held-out
0.000 -> 0.000(gate reject, zero edits) — indistinguishable from "nothing to learn" — when the real cause was thecodexbackend returning an error (expired auth / a model unsupported on a ChatGPT account / an outdated CLI) that got scored as a failed rollout. Nothing surfaced the error, so the loop silently never learned, and the per-cycle report gave no way to tell a genuine no-improvement night from a backend that never reached the model.Fix
CodexCliBackend(_callpath)_callinto_call_once+ a retry wrapper: transient empty/timeout calls are retried instead of silently returning""(mirrorsAzureOpenAIBackend's existing guard).execexit, surface the reason vialast_call_errorand return""rather than leaking the CLI's error text as if it were a model response.401 Unauthorized,refresh_token_reused,token_expired,not supported when using Codex with a ChatGPT account,requires a newer version of Codex.CodexCliBackend.attempt_with_tools(tool-rollout path)last_call_error(timeout / exception / non-zero exit) instead of swallowing the exception — so an auth/model/version failure on the tool path is visible in diagnostics rather than a silent empty→0. Response stays empty; the CLI error text is never returned as a "response".CliBackend.reflect— retainlast_reflect_rawso a no-edits night is diagnosable (empty/non-JSON optimizer reply vs genuinely no failures).consolidate—ConsolidationResultnow carries per-task held-out detail (response_len,response_head,hard/soft,why) +reflect_raw+call_error. Observability only; the gate's accept logic and the learning algorithm are unchanged.cycle— write adiagnostics.jsonper cycle so a0.0night self-explains; fullytry/except-wrapped so it can never break the cycle..gitignore— ignore the.skillopt-sleep/runtime dir.Why it matters
Turns a silent multi-night failure into a single-glance diagnosis: a backend auth/model/version problem now shows up immediately as a populated
call_errorindiagnostics.jsoninstead of masquerading as "the gate found nothing worth adopting."How it was tested
tests/test_sleep_engine.pysuite passes (52 passed, 1 skipped; the only 3 failures are pre-existing Windows path-separator assertions that also fail onmain). Tests:_callretry-not-silent-zero; auth-error surfaced-not-scored + no-retry; tool-path failure surfaced-not-silent;holdout_detailcapture;reflect_rawcapture.held-out 0.000 -> 0.000 rejectto0.000 -> 1.000 accept_new_best(2 edits staged), with a populateddiagnostics.json(emptycall_error, realreflect_raw, non-empty responses) — i.e. the loop produced its first real learning signal once the backend was reachable.attempt_with_toolstool-path change is covered by a unit test but was not exercised against a live failing tool-rollout end-to-end.ruffviolations (net −1 vsmain);py_compileclean.Review
Reviewed in a fresh, independent context against the stated intent (which surfaced the tool-path gap, now fixed).
Risk: low
Offline nightly tool; the change is error-handling hardening + read-only observability, not on any request/auth/data/money path, and does not alter the gate or learning algorithm. Worst case for a consumer is an unchanged cycle plus a
diagnostics.jsonfile.🤖 Generated with Claude Code