Skip to content

feat(puffin): support deletion-vector-v1 blob read/write#777

Open
zhaoxuan1994 wants to merge 1 commit into
apache:mainfrom
zhaoxuan1994:feat/puffin-deletion-vector-v1
Open

feat(puffin): support deletion-vector-v1 blob read/write#777
zhaoxuan1994 wants to merge 1 commit into
apache:mainfrom
zhaoxuan1994:feat/puffin-deletion-vector-v1

Conversation

@zhaoxuan1994

Copy link
Copy Markdown
Contributor

feat(puffin): support deletion-vector-v1 blob read/write

Add end-to-end support for the deletion-vector-v1 Puffin blob type on top
of the existing PuffinReader/PuffinWriter and RoaringPositionBitmap.

Encoding (puffin/deletion_vector.{h,cc}):

  • SerializeDeletionVectorBlob / DeserializeDeletionVectorBlob implement the
    DV blob framing: 4-byte big-endian length, 0xD1D33964 magic, the portable
    Roaring vector, and a trailing big-endian CRC-32. Reads validate the magic
    and checksum.
  • MakeDeletionVectorBlob builds a spec-compliant Blob: fields set to the
    row-position metadata column id, snapshot-id/sequence-number = -1, no
    compression, and the required referenced-data-file/cardinality properties.

Writing (data/deletion_vector_writer.{h,cc}):

  • DeletionVectorWriter accumulates deleted positions per data file, run-length
    encodes each bitmap before serializing, writes one blob per data file, and
    produces a DataFile per blob carrying content_offset/content_size_in_bytes,
    referenced_data_file and record_count for manifest registration.

Reading (data/delete_loader.cc):

  • Implement DeleteLoader::LoadDV: read the blob bytes referenced by
    content_offset/content_size_in_bytes, validate referenced_data_file, the 2GB
    limit and cardinality == record_count, then apply positions to the index.
    This wires DV deletes into DeleteFilter::ComputeAliveRows.

Puffin writer convenience:

  • PuffinWriter::Make now fills a default created-by property when the caller
    does not provide one, matching the Java writer.

Behavior matches the Java implementation (BaseDVFileWriter / BaseDeleteLoader /
BitmapPositionDeleteIndex): magic, byte order, CRC coverage, blob fields, RLE
and the read-side validations are all aligned.

Tests:

  • DV blob framing round-trip and error cases (bad magic, corrupted CRC,
    truncated blob, size mismatch).
  • DeletionVectorWriter -> DeleteLoader write/read round trip and guard cases.
  • delete_loader: load DV, referenced-file filtering, cardinality mismatch,
    mixed DV + position-delete loading.
  • delete_filter: end-to-end ComputeAliveRows filtering with a real Puffin DV
    file over Arrow FileIO.
  • PuffinWriter created-by default and caller-precedence.

Not included (deletion-vector upper layer, follow-up): merging with previously
written DVs (compaction), the bulk delete(PositionDeleteIndex) API, and
per-data-file partition/spec in a single DV file.

Copilot AI review requested due to automatic review settings June 24, 2026 08:22

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for Iceberg Puffin deletion-vector-v1 blobs, including framing/validation, write-side generation of DV Puffin files, and read-side application of DVs through DeleteLoader into DeleteFilter row filtering.

Changes:

  • Implement deletion-vector-v1 blob framing (length + magic + Roaring portable bytes + CRC32) plus spec-compliant Blob construction helpers.
  • Add DeletionVectorWriter to emit one DV blob per referenced data file and return manifest-ready DataFile metadata; wire DV loading into DeleteLoader.
  • Extend Puffin writer metadata defaults (created-by) and add comprehensive DV-focused unit tests + build system wiring.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/iceberg/test/puffin_reader_writer_test.cc Adds tests verifying default created-by behavior and caller override precedence.
src/iceberg/test/puffin_deletion_vector_test.cc New tests for DV blob framing (round-trip + corruption cases) and Puffin end-to-end DV read/write.
src/iceberg/test/meson.build Registers new DV-related test sources/targets in Meson.
src/iceberg/test/deletion_vector_writer_test.cc New end-to-end tests for DeletionVectorWriterDeleteLoader DV round trips and option/state guards.
src/iceberg/test/delete_loader_test.cc Adds DV load tests (load, skip mismatched referenced file, cardinality mismatch, mixed DV + position deletes).
src/iceberg/test/delete_filter_test.cc Adds ComputeAliveRows DV end-to-end test and updates expected error kinds.
src/iceberg/test/CMakeLists.txt Registers new DV-related tests in CMake.
src/iceberg/puffin/puffin_writer.cc Auto-populates created-by property when missing.
src/iceberg/puffin/meson.build Installs new Puffin DV header in Meson.
src/iceberg/puffin/deletion_vector.h New public API for DV blob serialize/deserialize + spec-compliant Blob creation.
src/iceberg/puffin/deletion_vector.cc Implements DV blob framing, magic/CRC validation, and blob construction.
src/iceberg/meson.build Adds new DV source files to the Meson build.
src/iceberg/data/meson.build Installs deletion_vector_writer.h in Meson.
src/iceberg/data/deletion_vector_writer.h New writer API for emitting DV blobs into a Puffin file and returning DataFile metadata.
src/iceberg/data/deletion_vector_writer.cc Implements DV accumulation per referenced file, optimization, Puffin blob emission, and metadata creation.
src/iceberg/data/delete_loader.cc Implements DV loading path: reads referenced bytes, validates, deserializes, and applies positions to the index.
src/iceberg/CMakeLists.txt Adds new DV source files to the CMake build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
Comment thread src/iceberg/data/deletion_vector_writer.cc
@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-deletion-vector-v1 branch 3 times, most recently from d4e713c to da65ba6 Compare June 24, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants