Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange by mo4islona · Pull Request #79 · subsquid/data

mo4islona · 2026-06-25T11:48:03Z

What & why

Fixes the NET-819 mainnet full-disk incident and NET-798 (blocking disk cleanup on startup).

During the incident the hotblocks volume hit 100% and compaction could not reclaim space: at a full disk compaction deadlocks (it needs scratch space to write merged output before deleting its inputs), and write-based deletion (tombstones) only grows the DB until compaction runs. The one operation that frees space without writing is DeleteFilesInRange, which unlinks whole SST files at the filesystem level.

Approach

Two independent mechanisms.

Phase 1 — routine logical cleanup (logical_cleanup, every 10s). Replaces the old per-key delete loop with a single range tombstone per deleted table. Snapshot-safe (range tombstones respect MVCC — no grace needed). Frees no space directly; it lets compaction drop the data. This is the normal-operation reclaim path.

Compaction tuning (now load-bearing, since routine reclaim relies entirely on it):

compact-on-deletion collector + 24h periodic compaction → find tombstone-heavy / stale SSTs;
max_background_jobs=8, max_subcompactions=4, level_compaction_dynamic_level_bytes → the default 2 background jobs could not keep up during the incident.

Physical reclaim — startup only (reclaim_disk_space). Unlinks whole CF_TABLES SST files below the min-live-TableId watermark via DeleteFilesInRange. No writes, no scratch space → frees disk even at 100%, where compaction deadlocks. This is the "reclaim before ingest" step (NET-798). Crash-orphaned DIRTY_TABLES markers are purged at startup so they don't pin the watermark.

Trade-off & deferred work

The file unlink ignores snapshots, so it can break an in-flight query whose snapshot predates a deletion. We therefore run it only at startup, where there are no live readers — it never corrupts the DB, only risks an in-flight read, and only at a moment when there are none.

A runtime disk-pressure trigger (free-space threshold via statvfs, or on a write ENOSPC) — the emergency reclaim that compaction cannot deliver at a full disk — is documented in code as a future improvement, not built here.

Changed from the earlier iteration

This supersedes the grace-period design. Removed: per-table deletion timestamps, the wall-clock grace gate (--reclaim-grace-secs), the runtime orphan reaper (--max-build-secs), and the two-phase bookkeeping. That eliminates a class of wall-clock data-loss bugs (clock jumps, slow clients outliving the grace, malformed-value coercion) in exchange for running the unlink only at startup. reclaim_disk_space and Phase 1 are now fully decoupled (watermark vs. work-queue).

Testing

crates/storage/tests/cleanup_reclaim.rs (8 tests): snapshot-safe logical delete, file unlink below the watermark, live-table watermark pinning, idempotency, startup orphan purge, reclaim/Phase-1 decoupling, and the documented "unlink breaks a live pre-deletion reader" trade-off. Full sqd-storage suite green (cleanup 8, compaction 5, database_ops 10, write_read 15, lib 1).

…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fixes the NET-819 full-disk incident and NET-798 (blocking startup cleanup). Two independent mechanisms: Phase 1 -- routine logical cleanup (every 10s): one range tombstone per deleted table instead of the old per-key delete loop. Snapshot-safe (no grace needed); lets compaction drop the data. This is the normal-operation reclaim path. Compaction tuning (now load-bearing): compact-on-deletion collector + 24h periodic compaction to find tombstone-heavy / stale SSTs, plus the backlog one-liners max_background_jobs=8, max_subcompactions=4 and level_compaction_dynamic_level_bytes -- the default 2 background jobs could not keep up during the incident. Physical reclaim -- startup only: reclaim_disk_space unlinks whole SST files below the min-live-TableId watermark via DeleteFilesInRange. No writes and no scratch space, so it frees disk even at 100% where compaction deadlocks. It is the "reclaim before ingest" step. Crash-orphaned dirty markers are purged at startup so they don't pin the watermark. The file unlink ignores snapshots, so it can break an in-flight pre-deletion query; it therefore runs only at startup, where there are no live readers. A runtime disk-pressure trigger (free-space threshold or write ENOSPC) is left as a documented future improvement -- it is the emergency reclaim compaction cannot deliver at a full disk. This supersedes the earlier grace-period design: per-table deletion timestamps, the wall-clock grace gate, and the runtime orphan reaper are all removed, along with their CLI flags. That eliminates a class of wall-clock data-loss bugs (clock jumps, slow clients outliving the grace) in exchange for running the unlink only at startup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LeUumqwJBHzqtFM3x3uZXQ

mo4islona force-pushed the sync-dataset-cleanup-flag branch from 0e7ab30 to be14d62 Compare June 25, 2026 15:08

mo4islona changed the title ~~Add sync_dataset_cleanup flag to defer table purge~~ Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Jun 25, 2026

mo4islona force-pushed the sync-dataset-cleanup-flag branch 2 times, most recently from b19920a to 7659b7c Compare June 25, 2026 15:22

mo4islona changed the title ~~Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798)~~ Reclaim hotblocks disk via DeleteFilesInRange + range deletes Jun 25, 2026

mo4islona force-pushed the sync-dataset-cleanup-flag branch 5 times, most recently from 98eac3f to 63dddfc Compare June 26, 2026 08:41

mo4islona changed the title ~~Reclaim hotblocks disk via DeleteFilesInRange + range deletes~~ Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange Jun 26, 2026

mo4islona marked this pull request as ready for review June 26, 2026 09:25

mo4islona force-pushed the sync-dataset-cleanup-flag branch from 63dddfc to 718995e Compare June 26, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79

Reclaim hotblocks disk: range-tombstone cleanup + startup DeleteFilesInRange#79
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag

mo4islona commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mo4islona commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Approach

Trade-off & deferred work

Changed from the earlier iteration

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mo4islona commented Jun 25, 2026 •

edited

Loading