Skip to content

feat: implement merge multiple DVs#708

Open
HuaHuaY wants to merge 1 commit into
apache:mainfrom
HuaHuaY:impl_merge_dv
Open

feat: implement merge multiple DVs#708
HuaHuaY wants to merge 1 commit into
apache:mainfrom
HuaHuaY:impl_merge_dv

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

No description provided.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements support for merging multiple deletion vectors (DVs) that reference the same data file by rewriting them into merged Puffin DV blobs, enabling format v3 workflows that previously returned NotImplemented.

Changes:

  • Add DV-merge pipeline in MergingSnapshotUpdate that unions multiple DV bitmaps per referenced data file and writes merged DV blobs into a Puffin container.
  • Introduce TransactionContext::NewDataLocation(...) to generate table-consistent data-file output locations for merged DV Puffin files.
  • Promote Puffin + RoaringPositionBitmap components from iceberg_data into the core iceberg library (exports + build wiring) and add a test covering DV merging.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/iceberg/update/merging_snapshot_update.h Adds DV merge APIs and tracking for merged DV outputs.
src/iceberg/update/merging_snapshot_update.cc Implements DV merging via Puffin reader/writer and bitmap union; adds cleanup handling for merged DV outputs.
src/iceberg/transaction.h Adds TransactionContext::NewDataLocation API.
src/iceberg/transaction.cc Implements NewDataLocation via LocationProvider.
src/iceberg/test/merging_snapshot_update_test.cc Adds an integration-style test validating duplicate DV merging behavior and correctness.
src/iceberg/puffin/puffin_writer.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/puffin/puffin_reader.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/puffin/puffin_format.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/puffin/json_serde_internal.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/puffin/file_metadata.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/deletes/roaring_position_bitmap.h Moves export macro to core ICEBERG_EXPORT.
src/iceberg/meson.build Moves Puffin + bitmap sources/deps into iceberg lib; adjusts croaring linkage.
src/iceberg/CMakeLists.txt Moves Puffin + bitmap sources into iceberg lib; adjusts croaring linkage propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
@HuaHuaY HuaHuaY force-pushed the impl_merge_dv branch 2 times, most recently from 284b077 to f25cbe1 Compare June 22, 2026 08:10
Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
Comment thread src/iceberg/meson.build
'catalog/session_catalog.cc',
'catalog/session_context.cc',
'delete_file_index.cc',
'deletes/roaring_position_bitmap.cc',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need to move it out of the deletes subdirectory.

ICEBERG_ASSIGN_OR_RAISE(auto blob, reader->ReadBlob(*blob_it));
ICEBERG_ASSIGN_OR_RAISE(
auto dv_bitmap,
RoaringPositionBitmap::Deserialize(std::string_view(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java/spec-compliant deletion-vector-v1 blobs are not raw Roaring payloads: they include the 4-byte length, magic bytes, bitmap, and CRC, and Java validates CRC/cardinality during deserialize. PuffinReader::ReadBlob returns the stored blob bytes unchanged, so passing them directly to RoaringPositionBitmap::Deserialize() will reject or misread DVs written by Java. The writer below has the symmetric issue because bitmap.Serialize() writes only the raw bitmap. Please add DV-level framing/parsing here.


ICEBERG_ASSIGN_OR_RAISE(
auto blob_metadata,
writer.Write(puffin::Blob{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Puffin blob metadata does not match Java or the Puffin DV spec. Java BaseDVFileWriter writes fields=[ROW_POSITION], snapshot-id=-1, sequence-number=-1, and referenced-data-file/cardinality properties. Writing empty fields, concrete snapshot/sequence values, and no properties produces invalid DV footer metadata for other implementations. Please mirror the Java DV writer metadata.

@wgtmac

wgtmac commented Jun 24, 2026

Copy link
Copy Markdown
Member

@zhaoxuan1994 is working on adding puffin reader/writer for v3 DVs. Let's wait for it and then rebase on it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants