Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79
Open
mo4islona wants to merge 1 commit into
Open
Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79mo4islona wants to merge 1 commit into
mo4islona wants to merge 1 commit into
Conversation
mo4islona
added a commit
that referenced
this pull request
Jun 25, 2026
…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0e7ab30 to
be14d62
Compare
mo4islona
added a commit
that referenced
this pull request
Jun 25, 2026
…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b19920a to
7659b7c
Compare
98eac3f to
63dddfc
Compare
Fixes the NET-819 full-disk incident and NET-798 (blocking startup cleanup). Two independent mechanisms: Phase 1 -- routine logical cleanup (every 10s): one range tombstone per deleted table instead of the old per-key delete loop. Snapshot-safe (no grace needed); lets compaction drop the data. This is the normal-operation reclaim path. Compaction tuning (now load-bearing): compact-on-deletion collector + 24h periodic compaction to find tombstone-heavy / stale SSTs, plus the backlog one-liners max_background_jobs=8, max_subcompactions=4 and level_compaction_dynamic_level_bytes -- the default 2 background jobs could not keep up during the incident. Physical reclaim -- startup only: reclaim_disk_space unlinks whole SST files below the min-live-TableId watermark via DeleteFilesInRange. No writes and no scratch space, so it frees disk even at 100% where compaction deadlocks. It is the "reclaim before ingest" step. Crash-orphaned dirty markers are purged at startup so they don't pin the watermark. The file unlink ignores snapshots, so it can break an in-flight pre-deletion query; it therefore runs only at startup, where there are no live readers. A runtime disk-pressure trigger (free-space threshold or write ENOSPC) is left as a documented future improvement -- it is the emergency reclaim compaction cannot deliver at a full disk. This supersedes the earlier grace-period design: per-table deletion timestamps, the wall-clock grace gate, and the runtime orphan reaper are all removed, along with their CLI flags. That eliminates a class of wall-clock data-loss bugs (clock jumps, slow clients outliving the grace) in exchange for running the unlink only at startup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LeUumqwJBHzqtFM3x3uZXQ
63dddfc to
718995e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Fixes the NET-819 mainnet full-disk incident and NET-798 (blocking disk cleanup on startup).
During the incident the hotblocks volume hit 100% and compaction could not reclaim space: at a full disk compaction deadlocks (it needs scratch space to write merged output before deleting its inputs), and write-based deletion (tombstones) only grows the DB until compaction runs. The one operation that frees space without writing is
DeleteFilesInRange, which unlinks whole SST files at the filesystem level.Approach
Two independent mechanisms.
Phase 1 — routine logical cleanup (
logical_cleanup, every 10s). Replaces the old per-key delete loop with a single range tombstone per deleted table. Snapshot-safe (range tombstones respect MVCC — no grace needed). Frees no space directly; it lets compaction drop the data. This is the normal-operation reclaim path.Compaction tuning (now load-bearing, since routine reclaim relies entirely on it):
max_background_jobs=8,max_subcompactions=4,level_compaction_dynamic_level_bytes→ the default 2 background jobs could not keep up during the incident.Physical reclaim — startup only (
reclaim_disk_space). Unlinks wholeCF_TABLESSST files below the min-live-TableIdwatermark viaDeleteFilesInRange. No writes, no scratch space → frees disk even at 100%, where compaction deadlocks. This is the "reclaim before ingest" step (NET-798). Crash-orphanedDIRTY_TABLESmarkers are purged at startup so they don't pin the watermark.Trade-off & deferred work
The file unlink ignores snapshots, so it can break an in-flight query whose snapshot predates a deletion. We therefore run it only at startup, where there are no live readers — it never corrupts the DB, only risks an in-flight read, and only at a moment when there are none.
A runtime disk-pressure trigger (free-space threshold via
statvfs, or on a writeENOSPC) — the emergency reclaim that compaction cannot deliver at a full disk — is documented in code as a future improvement, not built here.Changed from the earlier iteration
This supersedes the grace-period design. Removed: per-table deletion timestamps, the wall-clock grace gate (
--reclaim-grace-secs), the runtime orphan reaper (--max-build-secs), and the two-phase bookkeeping. That eliminates a class of wall-clock data-loss bugs (clock jumps, slow clients outliving the grace, malformed-value coercion) in exchange for running the unlink only at startup.reclaim_disk_spaceand Phase 1 are now fully decoupled (watermark vs. work-queue).Testing
crates/storage/tests/cleanup_reclaim.rs(8 tests): snapshot-safe logical delete, file unlink below the watermark, live-table watermark pinning, idempotency, startup orphan purge, reclaim/Phase-1 decoupling, and the documented "unlink breaks a live pre-deletion reader" trade-off. Fullsqd-storagesuite green (cleanup 8, compaction 5, database_ops 10, write_read 15, lib 1).