Skip to content

perf(extract): compress shards with zstd instead of snappy#1

Open
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:zstd-compression
Open

perf(extract): compress shards with zstd instead of snappy#1
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:zstd-compression

Conversation

@ak2k

@ak2k ak2k commented Jun 20, 2026

Copy link
Copy Markdown

Switches the extracted-shard Parquet writer from snappy to zstd (level 3).

Why

On a 184k-work sample (all 19 relationship tables, the 2026-03-29 works partition), total shard size drops 27%:

codec size vs snappy pyarrow read duckdb read
snappy 61.6 MB 0.09 s 0.012 s
zstd:3 45.2 MB −27% 0.08 s 0.011 s
zstd:6 42.6 MB −31% 0.07 s 0.011 s
zstd:9 41.4 MB −33% 0.07 s 0.011 s

(read columns: a full-table pyarrow.read_table across all shards, and a DuckDB GROUP BY over work_referenced_works.)

The tables are mostly integer ID columns that Parquet already dictionary/RLE-encodes, so the codec sees less raw redundancy than it would on prose — hence ~30% rather than the larger ratios zstd shows on text. Reads are no slower (slightly faster — the smaller files offset the marginally higher decode cost), and extraction is JSON-parse-bound, so the extra compression CPU is under 5% of wall-clock at level 3. The payoff is storage and HuggingFace upload bandwidth at snapshot scale.

Compatibility

zstd is self-describing per file, so existing snappy shards stay readable and only newly written shards use zstd — no re-extraction required, and the two coexist indefinitely.

Level 3 is a balance point; levels 6/9 buy a few more percent for growing write CPU. It's a one-line change via the new _PARQUET_COMPRESSION_LEVEL constant if you'd rather trade more CPU for smaller files. The worker memory-model comment already assumed a zstd compressor, so this also makes the code match it.

Switch the Parquet writer from snappy to zstd (level 3). On a 184k-work
sample (all 19 relationship tables, 2026-03-29 partition) total shard size
drops 27% (61.6 MB -> 45.2 MB); at higher levels zstd:6 is -31% and zstd:9
-33%. Reads are no slower — pyarrow full-table read and a DuckDB group-by
over work_referenced_works are both within noise of snappy (smaller files
offset the marginally higher decode cost). Extraction is JSON-parse-bound,
so the added compression CPU is <5% of wall-clock at level 3.

The win is mostly storage and HuggingFace upload bandwidth at snapshot
scale. zstd is self-describing per file, so existing snappy shards stay
readable and only newly written shards use zstd — no re-extraction needed.

The worker memory model comment already assumed a zstd compressor; this
makes the code match it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant