perf(extract): compress shards with zstd instead of snappy#1
Open
ak2k wants to merge 1 commit into
Open
Conversation
Switch the Parquet writer from snappy to zstd (level 3). On a 184k-work sample (all 19 relationship tables, 2026-03-29 partition) total shard size drops 27% (61.6 MB -> 45.2 MB); at higher levels zstd:6 is -31% and zstd:9 -33%. Reads are no slower — pyarrow full-table read and a DuckDB group-by over work_referenced_works are both within noise of snappy (smaller files offset the marginally higher decode cost). Extraction is JSON-parse-bound, so the added compression CPU is <5% of wall-clock at level 3. The win is mostly storage and HuggingFace upload bandwidth at snapshot scale. zstd is self-describing per file, so existing snappy shards stay readable and only newly written shards use zstd — no re-extraction needed. The worker memory model comment already assumed a zstd compressor; this makes the code match it.
This was referenced Jun 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Switches the extracted-shard Parquet writer from
snappytozstd(level 3).Why
On a 184k-work sample (all 19 relationship tables, the 2026-03-29
workspartition), total shard size drops 27%:(read columns: a full-table
pyarrow.read_tableacross all shards, and a DuckDBGROUP BYoverwork_referenced_works.)The tables are mostly integer ID columns that Parquet already dictionary/RLE-encodes, so the codec sees less raw redundancy than it would on prose — hence ~30% rather than the larger ratios zstd shows on text. Reads are no slower (slightly faster — the smaller files offset the marginally higher decode cost), and extraction is JSON-parse-bound, so the extra compression CPU is under 5% of wall-clock at level 3. The payoff is storage and HuggingFace upload bandwidth at snapshot scale.
Compatibility
zstd is self-describing per file, so existing snappy shards stay readable and only newly written shards use zstd — no re-extraction required, and the two coexist indefinitely.
Level 3 is a balance point; levels 6/9 buy a few more percent for growing write CPU. It's a one-line change via the new
_PARQUET_COMPRESSION_LEVELconstant if you'd rather trade more CPU for smaller files. The worker memory-model comment already assumed a zstd compressor, so this also makes the code match it.