Skip to content

Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79

Open
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag
Open

Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag

Conversation

@mo4islona

@mo4islona mo4islona commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What & why

Fixes the NET-819 mainnet full-disk incident and NET-798 (blocking disk cleanup on startup).

During the incident the hotblocks volume hit 100% and compaction could not reclaim space: at a full disk compaction deadlocks (it needs scratch space to write merged output before deleting its inputs), and write-based deletion (tombstones) only grows the DB until compaction runs. The one operation that frees space without writing is DeleteFilesInRange, which unlinks whole SST files at the filesystem level.

Approach

Two independent mechanisms.

Phase 1 — routine logical cleanup (logical_cleanup, every 10s). Replaces the old per-key delete loop with a single range tombstone per deleted table. Snapshot-safe (range tombstones respect MVCC — no grace needed). Frees no space directly; it lets compaction drop the data. This is the normal-operation reclaim path.

Compaction tuning (now load-bearing, since routine reclaim relies entirely on it):

  • compact-on-deletion collector + 24h periodic compaction → find tombstone-heavy / stale SSTs;
  • max_background_jobs=8, max_subcompactions=4, level_compaction_dynamic_level_bytes → the default 2 background jobs could not keep up during the incident.

Physical reclaim — startup only (reclaim_disk_space). Unlinks whole CF_TABLES SST files below the min-live-TableId watermark via DeleteFilesInRange. No writes, no scratch space → frees disk even at 100%, where compaction deadlocks. This is the "reclaim before ingest" step (NET-798). Crash-orphaned DIRTY_TABLES markers are purged at startup so they don't pin the watermark.

Trade-off & deferred work

The file unlink ignores snapshots, so it can break an in-flight query whose snapshot predates a deletion. We therefore run it only at startup, where there are no live readers — it never corrupts the DB, only risks an in-flight read, and only at a moment when there are none.

A runtime disk-pressure trigger (free-space threshold via statvfs, or on a write ENOSPC) — the emergency reclaim that compaction cannot deliver at a full disk — is documented in code as a future improvement, not built here.

Changed from the earlier iteration

This supersedes the grace-period design. Removed: per-table deletion timestamps, the wall-clock grace gate (--reclaim-grace-secs), the runtime orphan reaper (--max-build-secs), and the two-phase bookkeeping. That eliminates a class of wall-clock data-loss bugs (clock jumps, slow clients outliving the grace, malformed-value coercion) in exchange for running the unlink only at startup. reclaim_disk_space and Phase 1 are now fully decoupled (watermark vs. work-queue).

Testing

crates/storage/tests/cleanup_reclaim.rs (8 tests): snapshot-safe logical delete, file unlink below the watermark, live-table watermark pinning, idempotency, startup orphan purge, reclaim/Phase-1 decoupling, and the documented "unlink breaks a live pre-deletion reader" trade-off. Full sqd-storage suite green (cleanup 8, compaction 5, database_ops 10, write_read 15, lib 1).

mo4islona added a commit that referenced this pull request Jun 25, 2026
…9, NET-798)

Replace the per-key tombstone purge with a two-phase table cleanup that can
actually reclaim disk space, including at a full disk.

Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a
single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of
millions of point deletes, then marks it reclaim-pending. Range tombstones
respect snapshots, so in-flight queries are unaffected.

Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the
live watermark (min live TableId across all chunks + dirty tables) via
DeleteFilesInRange. It performs no writes and needs no scratch space, so it
makes progress even at 100% disk. A per-table deletion timestamp in
CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink
ignores snapshots, so grace must exceed the max query/snapshot lifetime.

Also:
- TABLES CF: compact-on-deletion collector + 24h periodic compaction so
  compaction finds tombstone-heavy / boundary files (NET-819).
- hotblocks: the cleanup loop runs both phases each tick and backs off on
  error instead of busy-looping failing writes; startup reclaims unconfigured
  datasets' files before serving with a zero grace, since no readers exist yet
  (NET-798). New --reclaim-grace-secs flag (default 15m).

Tests: logical-delete snapshot safety, physical reclaim after grace, watermark
pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and
a value-codec/migration unit test.

Supersedes the sync_dataset_cleanup flag (draft PR #79).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch from 0e7ab30 to be14d62 Compare June 25, 2026 15:08
@mo4islona mo4islona changed the title Add sync_dataset_cleanup flag to defer table purge Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Jun 25, 2026
mo4islona added a commit that referenced this pull request Jun 25, 2026
…9, NET-798)

Replace the per-key tombstone purge with a two-phase table cleanup that can
actually reclaim disk space, including at a full disk.

Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a
single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of
millions of point deletes, then marks it reclaim-pending. Range tombstones
respect snapshots, so in-flight queries are unaffected.

Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the
live watermark (min live TableId across all chunks + dirty tables) via
DeleteFilesInRange. It performs no writes and needs no scratch space, so it
makes progress even at 100% disk. A per-table deletion timestamp in
CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink
ignores snapshots, so grace must exceed the max query/snapshot lifetime.

Also:
- TABLES CF: compact-on-deletion collector + 24h periodic compaction so
  compaction finds tombstone-heavy / boundary files (NET-819).
- hotblocks: the cleanup loop runs both phases each tick and backs off on
  error instead of busy-looping failing writes; startup reclaims unconfigured
  datasets' files before serving with a zero grace, since no readers exist yet
  (NET-798). New --reclaim-grace-secs flag (default 15m).

Tests: logical-delete snapshot safety, physical reclaim after grace, watermark
pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and
a value-codec/migration unit test.

Supersedes the sync_dataset_cleanup flag (draft PR #79).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch 2 times, most recently from b19920a to 7659b7c Compare June 25, 2026 15:22
@mo4islona mo4islona changed the title Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Reclaim hotblocks disk via DeleteFilesInRange + range deletes Jun 25, 2026
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch 5 times, most recently from 98eac3f to 63dddfc Compare June 26, 2026 08:41
@mo4islona mo4islona changed the title Reclaim hotblocks disk via DeleteFilesInRange + range deletes Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange Jun 26, 2026
@mo4islona mo4islona marked this pull request as ready for review June 26, 2026 09:25
Fixes the NET-819 full-disk incident and NET-798 (blocking startup cleanup).

Two independent mechanisms:

Phase 1 -- routine logical cleanup (every 10s): one range tombstone per deleted
table instead of the old per-key delete loop. Snapshot-safe (no grace needed);
lets compaction drop the data. This is the normal-operation reclaim path.

Compaction tuning (now load-bearing): compact-on-deletion collector + 24h
periodic compaction to find tombstone-heavy / stale SSTs, plus the backlog
one-liners max_background_jobs=8, max_subcompactions=4 and
level_compaction_dynamic_level_bytes -- the default 2 background jobs could not
keep up during the incident.

Physical reclaim -- startup only: reclaim_disk_space unlinks whole SST files
below the min-live-TableId watermark via DeleteFilesInRange. No writes and no
scratch space, so it frees disk even at 100% where compaction deadlocks. It is
the "reclaim before ingest" step. Crash-orphaned dirty markers are purged at
startup so they don't pin the watermark.

The file unlink ignores snapshots, so it can break an in-flight pre-deletion
query; it therefore runs only at startup, where there are no live readers. A
runtime disk-pressure trigger (free-space threshold or write ENOSPC) is left as
a documented future improvement -- it is the emergency reclaim compaction cannot
deliver at a full disk.

This supersedes the earlier grace-period design: per-table deletion timestamps,
the wall-clock grace gate, and the runtime orphan reaper are all removed, along
with their CLI flags. That eliminates a class of wall-clock data-loss bugs
(clock jumps, slow clients outliving the grace) in exchange for running the
unlink only at startup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LeUumqwJBHzqtFM3x3uZXQ
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch from 63dddfc to 718995e Compare June 26, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant