Skip to content

fix: add Cassandra compaction strategies and improve GC reliability#130

Merged
geored merged 1 commit into
Commonjava:masterfrom
geored:fix/cassandra-compaction-and-gc
Jun 24, 2026
Merged

fix: add Cassandra compaction strategies and improve GC reliability#130
geored merged 1 commit into
Commonjava:masterfrom
geored:fix/cassandra-compaction-and-gc

Conversation

@geored

@geored geored commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

  • Add compaction strategies to all CREATE TABLE statements (LCS for pathmap/reversemap/filechecksum, TWCS for reclaim)
  • Fix listOrphanedFiles() to query all 24 hourly partitions instead of only the current hour
  • Fix listOrphanedFiles() to use executeSession() wrapper for Cassandra reconnection handling
  • Improve gc() error handling with consecutive failure counting (escalates WARN → ERROR after 3 failures)

Context

Production incident on 2026-06-17 in indy--runtime-int:

  • 187:1 tombstone-to-live-row ratio on indystorage.reclaim
  • 41-61 second GC pauses on Cassandra nodes
  • Build promotion 504 timeouts blocking multiple product releases
  • Stale maven-metadata.xml and npm proxy socket timeouts

Root causes in this codebase:

  1. No compaction strategy on any CREATE TABLE — defaults to STCS (worst for deletion-heavy workloads)
  2. `listOrphanedFiles()` only queries current hour partition — 23 of 24 partitions never processed
  3. `listOrphanedFiles()` bypasses reconnection handling (uses raw session.execute)
  4. `gc()` swallows all exceptions silently with no failure tracking

Operational workarounds applied on prod (still running):

  • ALTER TABLE compaction strategies
  • Manual nodetool compact
  • Cassandra node restarts

This PR ensures new deployments get correct configuration from the start.

Backward Compatibility

  • CREATE TABLE IF NOT EXISTS won't alter existing tables
  • Existing deployments need ALTER TABLE to apply new strategies
  • No data migration or API changes

Jira: SPRE-5528

🤖 Generated with Claude Code

Production incident (2026-06-17): Indy Cassandra cluster experienced
187:1 tombstone ratios, 41-61s GC pauses, and build promotion 504
timeouts caused by default SizeTieredCompactionStrategy on all tables
and GC logic that only processes 1 of 24 hourly partitions.

Changes:
- Add compaction strategies to all CREATE TABLE statements:
  - reclaim: TimeWindowCompactionStrategy (time-series, deletion-heavy)
  - pathmap, reversemap, filechecksum: LeveledCompactionStrategy
  - reclaim gc_grace_seconds set to 14400 (4 hours)
- Fix listOrphanedFiles() to query all 24 hourly partitions instead
  of only the current hour, with per-partition error handling
- Fix listOrphanedFiles() to use executeSession() wrapper for
  Cassandra reconnection handling on NoHostAvailableException
- Improve gc() error handling with consecutive failure counting
  that escalates from WARN to ERROR after 3 failures

Backward compatible: CREATE TABLE IF NOT EXISTS won't alter existing
tables. Existing deployments need ALTER TABLE to apply new strategies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@jdcasey jdcasey left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about the code. Longer term, it would be better to execute the database setup as a separate cqlsh script so Indy could run with reduced permissions, but this is good as a patch (providing that BoundStatement is safe without a call to close it...maybe try-with-resources would be good there? I'm not familiar with how Cassandra's API works there.

@geored

geored commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Indy could run with reduced permissions, but this is good as a patch (providing that BoundStatement is safe without a call to close it...maybe try-with-

i created the cronjob for this compaction to run on critical cassandra events. PR/MR incoming soon .

@geored geored merged commit 0c7a7e6 into Commonjava:master Jun 24, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants