Skip to content

fix(connect): close TOCTOU race in connect sync lock acquisition#113

Open
omergk28 wants to merge 1 commit into
ActiveMemory:mainfrom
omergk28:fix/93-connect-sync-lock-toctou
Open

fix(connect): close TOCTOU race in connect sync lock acquisition#113
omergk28 wants to merge 1 commit into
ActiveMemory:mainfrom
omergk28:fix/93-connect-sync-lock-toctou

Conversation

@omergk28

@omergk28 omergk28 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

loadState in internal/cli/connection/core/sync guarded ctx connect sync with a stat-then-write pair — racy by construction. Two processes racing past the os.Stat could both acquire, both load the same LastSequence, and both write duplicate entries into .context/hub/.

This PR replaces the pair with the atomic create-or-fail primitive the codebase already ships for exactly this purpose: io.SafeTryLock (O_CREATE|O_EXCL, one syscall) + io.SafeUnlock, with prior art at internal/cli/dream/core/pass/pass.go:62-70. !acquired maps to the pre-existing os.ErrExist contract, so caller-facing behavior is unchanged.

Closes #93.

Changes

  • internal/cli/connection/core/sync/state.go — atomic lock acquisition; release delegates to SafeUnlock (warn-on-failure logging kept)
  • internal/config/hub/{hub.go,doc.go}LockSentinel removed: the lock is the file's existence, not its content, and the racy write was the constant's only consumer
  • internal/cli/connection/core/sync/state_test.go — new; the package previously had zero tests
  • specs/fix-connect-sync-lock-toctou.md — spec per the project's spec-per-commit invariant

Tests (mapped to the issue's Tests Required)

Issue ask Test
N goroutines, exactly one succeeds, N−1 get os.ErrExist TestLoadState_RejectsConcurrentSyncs (16-way; winners held until all results are collected, so the count is deterministic)
Lock released after failure so the next attempt proceeds TestLoadState_ReleaseRemovesLock + TestLoadState_ReleasesLockOnCorruptState (post-acquisition failure must not leak the lock)
Lock lives at <ctxDir>/hub/.sync.lock TestLoadState_LockFileLocation

Verification

  • Mutation check: the new contention test was run against the old stat-then-write code — it fails with winners = 16, want exactly 1, proving both that the race is real and that the test catches it. Against the fix: exactly 1 winner, 15 × os.ErrExist.
  • go test -race -count=10 ./internal/cli/connection/core/sync/ — clean.
  • Full CI suite green (fmt, vet, golangci-lint, style, drift, docs, full test suite, shellcheck).

Notes

🤖 Generated with Claude Code

The sync lock was a stat-then-write: two processes racing past the
existence check could both acquire, both load the same LastSequence,
and both write duplicate entries into .context/hub/. Replace the
pair with the atomic io.SafeTryLock (O_CREATE|O_EXCL, single syscall)
and release via io.SafeUnlock — the same primitive the dream pass
already uses, preserving the os.ErrExist contract for callers.

- LockSentinel removed: the lock is the file's existence, not its
  content, and the racy write was the constant's only consumer.
- state_test.go regression-pins the contract: 16-way contention
  yields exactly one winner, release frees the next sync, a corrupt
  state file does not leak the lock, and the lock path stays at
  <ctxDir>/hub/.sync.lock.

Closes ActiveMemory#93.

Spec: specs/fix-connect-sync-lock-toctou.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Omer Kocaoglu <omergk28@gmail.com>
@josealekhine

Copy link
Copy Markdown
Member

Thanks for your contributions @omergk28; as usual, your rock 🚀 .

Bottom line: textbook-correct concurrency fix, excellent tests.

The fix is correct, and I checked the part that matters

The old code was a genuine TOCTOU — os.Stat (check) then SafeWriteFile (use), with an unbounded window where two syncs both see "no lock" and both proceed:

// before
if _, statErr := os.Stat(lockPath); statErr == nil {
    return s, nil, os.ErrExist
}
if writeErr := io.SafeWriteFile(lockPath, []byte(cfgHub.LockSentinel), fs.PermFile); writeErr != nil {
    return s, nil, writeErr
}

The fix swaps it for io.SafeTryLock, and I confirmed that primitive is genuinely atomic rather than another stat-then-write:

// internal/io/security.go — the load-bearing primitive
f, openErr := os.OpenFile(clean, os.O_CREATE|os.O_EXCL|os.O_WRONLY, perm)
if openErr != nil {
    if os.IsExist(openErr) { return false, nil }   // already held → (false, nil)
    return false, openErr                            // real error → (false, err)
}
return true, f.Close()                               // acquired → (true, nil)

O_CREATE|O_EXCL is the canonical single-syscall create-or-fail, and the return semantics are exactly what state.go consumes:

// internal/cli/connection/core/sync/state.go — after
acquired, lockErr := io.SafeTryLock(lockPath, fs.PermFile)
if lockErr != nil {
    return s, nil, lockErr
}
if !acquired {
    return s, nil, os.ErrExist   // preserves the caller-facing contract
}

No window remains.

Things I checked that hold up

  • Lock released on every post-acquire error path. loadState calls release() before returning on both the read-error and corrupt-JSON paths, so the lock can't leak (and TestLoadState_ReleasesLockOnCorruptState pins it).

  • No nil-defer panic. The caller returns early on error before deferring release:

    // internal/cli/connection/core/sync/sync.go (Run)
    syncState, releaseLock, stateErr := loadState()
    if stateErr != nil {
        return stateErr     // returns before the defer; releaseLock is nil here
    }
    defer releaseLock()     // only runs when non-nil; lock held for the whole sync
  • Clean removal. LockSentinel is dead after the change and fully removed (const + package-doc inventory); no dangling references anywhere in internal/.

  • Reuses an established primitive. SafeTryLock/SafeUnlock already back internal/cli/dream/core/pass/pass.go:62-70, so this brings connect-sync in line with existing prior art rather than inventing a new lock.

  • Lock and state files are distinctFileSyncLock = ".sync.lock" vs FileSyncState = ".sync-state.json".

  • Tests are excellent where there were none before. TestLoadState_RejectsConcurrentSyncs races 16 goroutines and asserts exactly 1 winner / 15 os.ErrExist, and deliberately holds the winner's lock until all attempts finish (so a late re-acquire can't skew the count). It passes under -race, stressed 5×. Plus release-removes-lock, release-on-corrupt-state, and lock-location tests.


Findings

1. Stale-lock wedge + opaque error

A crashed or SIGKILL'd sync leaves .sync.lock behind, and now that mutual exclusion is reliable, every subsequent sync hard-fails with os.ErrExist until the file is removed by hand. The spec acknowledges stale-lock recovery (PID-in-lockfile / age-based) as out-of-scope — fair — but making the lock correct raises the stakes of that deferred work. Compounding it, Run returns the raw os.ErrExist:

// sync.go
syncState, releaseLock, stateErr := loadState()
if stateErr != nil {
    return stateErr   // user sees "file already exists", no path hint
}

So a wedged lock surfaces as "file already exists" with no hint to remove <ctxDir>/hub/.sync.lock. The opaque message is pre-existing (the old code returned os.ErrExist too), but it's a cheap win to wrap it with the lock path + an "another sync is running, or remove the stale lock" hint — even ahead of full recovery. Low-medium; follow-up, not a blocker.

2. (theoretical) Lock leak if f.Close() fails after a successful create

In SafeTryLock, a create-succeeds-but-Close-fails returns (true, closeErr); the caller treats a non-nil error as "not acquired" and returns with no release func, yet the file was created → leaked lock. Close() on a freshly O_EXCL-created empty local file essentially never fails (only exotic ENOSPC-on-close / NFS). Pre-existing primitive; negligible. Footnote only.

3. (footnote) NFS caveat

O_EXCL atomicity holds on local filesystems and modern NFSv3+/v4; only ancient NFSv2 is unsafe. The lock lives in the project's .context/hub/, virtually always local. Negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ctx connect sync lock has a TOCTOU race; concurrent syncs can both pass the check

2 participants