perf(extract): compress shards with zstd instead of snappy by ak2k · Pull Request #1 · Mearman/OpenAlex

ak2k · 2026-06-20T16:24:07Z

Switches the extracted-shard Parquet writer from snappy to zstd (level 3).

Why

On a 184k-work sample (all 19 relationship tables, the 2026-03-29 works partition), total shard size drops 27%:

codec	size	vs snappy	pyarrow read	duckdb read
snappy	61.6 MB	—	0.09 s	0.012 s
zstd:3	45.2 MB	−27%	0.08 s	0.011 s
zstd:6	42.6 MB	−31%	0.07 s	0.011 s
zstd:9	41.4 MB	−33%	0.07 s	0.011 s

(read columns: a full-table pyarrow.read_table across all shards, and a DuckDB GROUP BY over work_referenced_works.)

The tables are mostly integer ID columns that Parquet already dictionary/RLE-encodes, so the codec sees less raw redundancy than it would on prose — hence ~30% rather than the larger ratios zstd shows on text. Reads are no slower (slightly faster — the smaller files offset the marginally higher decode cost), and extraction is JSON-parse-bound, so the extra compression CPU is under 5% of wall-clock at level 3. The payoff is storage and HuggingFace upload bandwidth at snapshot scale.

Compatibility

zstd is self-describing per file, so existing snappy shards stay readable and only newly written shards use zstd — no re-extraction required, and the two coexist indefinitely.

Level 3 is a balance point; levels 6/9 buy a few more percent for growing write CPU. It's a one-line change via the new _PARQUET_COMPRESSION_LEVEL constant if you'd rather trade more CPU for smaller files. The worker memory-model comment already assumed a zstd compressor, so this also makes the code match it.

Switch the Parquet writer from snappy to zstd (level 3). On a 184k-work sample (all 19 relationship tables, 2026-03-29 partition) total shard size drops 27% (61.6 MB -> 45.2 MB); at higher levels zstd:6 is -31% and zstd:9 -33%. Reads are no slower — pyarrow full-table read and a DuckDB group-by over work_referenced_works are both within noise of snappy (smaller files offset the marginally higher decode cost). Extraction is JSON-parse-bound, so the added compression CPU is <5% of wall-clock at level 3. The win is mostly storage and HuggingFace upload bandwidth at snapshot scale. zstd is self-describing per file, so existing snappy shards stay readable and only newly written shards use zstd — no re-extraction needed. The worker memory model comment already assumed a zstd compressor; this makes the code match it.

This was referenced Jun 20, 2026

Add a DuckDB-queryable edge-list Parquet for the citation graph #4

Open

feat(build_csr): export a queryable edge-list Parquet (by_src) #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(extract): compress shards with zstd instead of snappy#1

perf(extract): compress shards with zstd instead of snappy#1
ak2k wants to merge 1 commit into
Mearman:mainfrom
ak2k:zstd-compression

ak2k commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ak2k commented Jun 20, 2026

Why

Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant