fix: add Cassandra compaction strategies and improve GC reliability#130
Merged
Merged
Conversation
Production incident (2026-06-17): Indy Cassandra cluster experienced 187:1 tombstone ratios, 41-61s GC pauses, and build promotion 504 timeouts caused by default SizeTieredCompactionStrategy on all tables and GC logic that only processes 1 of 24 hourly partitions. Changes: - Add compaction strategies to all CREATE TABLE statements: - reclaim: TimeWindowCompactionStrategy (time-series, deletion-heavy) - pathmap, reversemap, filechecksum: LeveledCompactionStrategy - reclaim gc_grace_seconds set to 14400 (4 hours) - Fix listOrphanedFiles() to query all 24 hourly partitions instead of only the current hour, with per-partition error handling - Fix listOrphanedFiles() to use executeSession() wrapper for Cassandra reconnection handling on NoHostAvailableException - Improve gc() error handling with consecutive failure counting that escalates from WARN to ERROR after 3 failures Backward compatible: CREATE TABLE IF NOT EXISTS won't alter existing tables. Existing deployments need ALTER TABLE to apply new strategies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jdcasey
reviewed
Jun 23, 2026
jdcasey
left a comment
Member
There was a problem hiding this comment.
Just one question about the code. Longer term, it would be better to execute the database setup as a separate cqlsh script so Indy could run with reduced permissions, but this is good as a patch (providing that BoundStatement is safe without a call to close it...maybe try-with-resources would be good there? I'm not familiar with how Cassandra's API works there.
jdcasey
approved these changes
Jun 23, 2026
Member
Author
i created the cronjob for this compaction to run on critical cassandra events. PR/MR incoming soon . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
listOrphanedFiles()to query all 24 hourly partitions instead of only the current hourlistOrphanedFiles()to useexecuteSession()wrapper for Cassandra reconnection handlinggc()error handling with consecutive failure counting (escalates WARN → ERROR after 3 failures)Context
Production incident on 2026-06-17 in
indy--runtime-int:indystorage.reclaimRoot causes in this codebase:
Operational workarounds applied on prod (still running):
This PR ensures new deployments get correct configuration from the start.
Backward Compatibility
Jira: SPRE-5528
🤖 Generated with Claude Code