feat(puffin): support deletion-vector-v1 blob read/write#777
Open
zhaoxuan1994 wants to merge 1 commit into
Open
feat(puffin): support deletion-vector-v1 blob read/write#777zhaoxuan1994 wants to merge 1 commit into
zhaoxuan1994 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for Iceberg Puffin deletion-vector-v1 blobs, including framing/validation, write-side generation of DV Puffin files, and read-side application of DVs through DeleteLoader into DeleteFilter row filtering.
Changes:
- Implement
deletion-vector-v1blob framing (length + magic + Roaring portable bytes + CRC32) plus spec-compliantBlobconstruction helpers. - Add
DeletionVectorWriterto emit one DV blob per referenced data file and return manifest-readyDataFilemetadata; wire DV loading intoDeleteLoader. - Extend Puffin writer metadata defaults (
created-by) and add comprehensive DV-focused unit tests + build system wiring.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/puffin_reader_writer_test.cc | Adds tests verifying default created-by behavior and caller override precedence. |
| src/iceberg/test/puffin_deletion_vector_test.cc | New tests for DV blob framing (round-trip + corruption cases) and Puffin end-to-end DV read/write. |
| src/iceberg/test/meson.build | Registers new DV-related test sources/targets in Meson. |
| src/iceberg/test/deletion_vector_writer_test.cc | New end-to-end tests for DeletionVectorWriter ↔ DeleteLoader DV round trips and option/state guards. |
| src/iceberg/test/delete_loader_test.cc | Adds DV load tests (load, skip mismatched referenced file, cardinality mismatch, mixed DV + position deletes). |
| src/iceberg/test/delete_filter_test.cc | Adds ComputeAliveRows DV end-to-end test and updates expected error kinds. |
| src/iceberg/test/CMakeLists.txt | Registers new DV-related tests in CMake. |
| src/iceberg/puffin/puffin_writer.cc | Auto-populates created-by property when missing. |
| src/iceberg/puffin/meson.build | Installs new Puffin DV header in Meson. |
| src/iceberg/puffin/deletion_vector.h | New public API for DV blob serialize/deserialize + spec-compliant Blob creation. |
| src/iceberg/puffin/deletion_vector.cc | Implements DV blob framing, magic/CRC validation, and blob construction. |
| src/iceberg/meson.build | Adds new DV source files to the Meson build. |
| src/iceberg/data/meson.build | Installs deletion_vector_writer.h in Meson. |
| src/iceberg/data/deletion_vector_writer.h | New writer API for emitting DV blobs into a Puffin file and returning DataFile metadata. |
| src/iceberg/data/deletion_vector_writer.cc | Implements DV accumulation per referenced file, optimization, Puffin blob emission, and metadata creation. |
| src/iceberg/data/delete_loader.cc | Implements DV loading path: reads referenced bytes, validates, deserializes, and applies positions to the index. |
| src/iceberg/CMakeLists.txt | Adds new DV source files to the CMake build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d4e713c to
da65ba6
Compare
da65ba6 to
d3f8f42
Compare
23 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(puffin): support deletion-vector-v1 blob read/write
Add end-to-end support for the
deletion-vector-v1Puffin blob type on topof the existing PuffinReader/PuffinWriter and RoaringPositionBitmap.
Encoding (puffin/deletion_vector.{h,cc}):
DV blob framing: 4-byte big-endian length, 0xD1D33964 magic, the portable
Roaring vector, and a trailing big-endian CRC-32. Reads validate the magic
and checksum.
row-position metadata column id, snapshot-id/sequence-number = -1, no
compression, and the required referenced-data-file/cardinality properties.
Writing (data/deletion_vector_writer.{h,cc}):
encodes each bitmap before serializing, writes one blob per data file, and
produces a DataFile per blob carrying content_offset/content_size_in_bytes,
referenced_data_file and record_count for manifest registration.
Reading (data/delete_loader.cc):
content_offset/content_size_in_bytes, validate referenced_data_file, the 2GB
limit and cardinality == record_count, then apply positions to the index.
This wires DV deletes into DeleteFilter::ComputeAliveRows.
Puffin writer convenience:
created-byproperty when the callerdoes not provide one, matching the Java writer.
Behavior matches the Java implementation (BaseDVFileWriter / BaseDeleteLoader /
BitmapPositionDeleteIndex): magic, byte order, CRC coverage, blob fields, RLE
and the read-side validations are all aligned.
Tests:
truncated blob, size mismatch).
mixed DV + position-delete loading.
file over Arrow FileIO.
Not included (deletion-vector upper layer, follow-up): merging with previously
written DVs (compaction), the bulk delete(PositionDeleteIndex) API, and
per-data-file partition/spec in a single DV file.