fix(hub): surface silently swallowed errors in hubsync hook and replication loop#114
fix(hub): surface silently swallowed errors in hubsync hook and replication loop#114omergk28 wants to merge 1 commit into
Conversation
…cation loop The session-start hubsync hook returned "" on config-load, dial, sync-RPC, and write failures with no trace, and conflated a real sync error with the genuine "nothing new" result. The follower replication loop returned silently on dial, stream-open, send, and close-send failures and on every receive error. Operators could not distinguish a healthy quiet sync from one that never reached the network. Wire every silent return through the warn sink with format constants in config/warn. Behavior is otherwise unchanged: both functions keep their signatures and never-block contracts. - Sync now warns per failure site and stays silent only for a genuine zero-entry result (the un-conflation from ActiveMemory#100). - replicateOnce warns per failure site; the receive path stays silent for the eof() end-of-stream (the one deliberate deviation from the issue's proposed code, which would have warned once per replication cycle) and for caller shutdown. - Two ActiveMemory#100 sub-items (close-defer warn, append warn + keep consuming) had already landed; this covers the rest and adds the regression tests the issue asked for, including pinning keep-consuming-after-append-error and clean-EOF-no-warn. Closes ActiveMemory#100. Spec: specs/fix-hub-silent-error-suppression.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Omer Kocaoglu <omergk28@gmail.com>
|
The fix is correct, and the key insight is right Some nits below: The real bug was the conflation of "error" with "nothing new": // internal/cli/system/core/hubsync/sync.go — before
entries, syncErr := client.Sync(context.Background(), cfg.Types, 0)
if syncErr != nil || len(entries) == 0 {
return ""
}The PR un-conflates them so a real error warns while a genuine empty result stays silent: // after
if syncErr != nil {
logWarn.Warn(cfgWarn.HubSyncPull, cfg.HubAddr, syncErr)
return ""
}
if len(entries) == 0 {
// Genuine empty result: not an error, no warning.
return ""
}Every other silent // internal/hub/replicate.go
if recvErr := stream.RecvMsg(msg); recvErr != nil {
if !eof(recvErr) && ctx.Err() == nil {
logWarn.Warn(cfgWarn.HubReplicateRecv, masterAddr, recvErr)
}
return
}…and the spec documents this as a deliberate deviation from issue #100's proposed code (which warned on every recv error and would have made clean cycles noisy). Things I checked that hold up
Findings (all minor / adjacent — none are blockers)1. (adjacent, pre-existing) The "must not block the session" contract isn't actually guaranteedThe doc update reaffirms the contract: // internal/cli/system/core/hubsync/doc.go
// Every error is surfaced as a stderr warning via the warn sink,
// but never propagates: the hook must not block the session start.But the underlying RPC has no deadline: // sync.go — context.Background(), no timeout
entries, syncErr := client.Sync(context.Background(), cfg.Types, 0)
// internal/hub/client.go — lazy NewClient, fail-fast (not WaitForReady), no dial timeout
conn, dialErr := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), ...)A hub that accepts the TCP connection but never responds (hung server / black-hole proxy) makes 2. (forward-looking) Replication warning cadence when wired
3. (trivial)
|
Summary
Two hub code paths swallowed errors with no logging surface — the session-start hubsync hook (
internal/cli/system/core/hubsync) and the cluster replication loop (internal/hub/replicate.go). Operators couldn't tell a healthy quiet sync from one that never reached the network.This PR wires every silent return through the established
warnsink with format constants inconfig/warn, and un-conflates the sync-error check from the empty-result check (only the error warns now). Both functions keep their signatures and never-block contracts — logging is the only behavior change.Closes #100.
Already landed
Two of #100's sub-items were already in main before this PR (the
conn.Closedefer warn and thestore.Appendwarn + keep-consuming, viaCloseHubClient/HubReplicateAppend). This PR covers the rest and adds the regression tests the issue asked for — including pinning the already-landed keep-consuming behavior, which previously had no coverage.One deliberate deviation from the issue's proposed code
The issue's sketch warns on every
RecvMsgerror — butio.EOFis the normal end of every sync stream, so that would emit a warning once perReplicateIntervalon perfectly healthy replication. The receive path here stays silent foreof()(the hub package's own end-of-stream helper, same as the other two RecvMsg sites) and for caller shutdown (ctx.Err() != nil), and warns on everything else.TestReplicateOnce_CleanReplicationDoesNotWarnpins this.Tests (mapped to the issue's Tests Required)
TestSync_WarnsOnLoadErrorTestSync_WarnsOnDialError— empirically, the only eagergrpc.NewClientfailure is a control character failing URL parse; every other malformed target (://invalid…) defers to first use.TestSync_WarnsOnPullErrorcovers the lazy path (closed port → Sync RPC fails)TestSync_NoWarnOnEmptyResult— real in-process hub, zero entries, asserts an empty warn bufferTestReplicateOnce_WarnsOnDialError+TestReplicateOnce_WarnsOnTransportErrorTestReplicateOnce_KeepsConsumingAfterAppendError— read-only follower dir, asserts exactly 2 append warnings (root/Windows skip guards per therc/require_test.goprecedent)All use the existing
warn.SetSinkseam; the hubsync package previously had zero tests.Verification
go test -race -count=2on both packages — clean.syncErrwith the empty check failsTestSync_NoWarnOnEmptyResult; (2) removing the EOF guard failsTestReplicateOnce_CleanReplicationDoesNotWarn; (3) dropping the dial warn failsTestSync_WarnsOnDialError.make audit— all checks passed.🤖 Generated with Claude Code