Skip to content

Extract DeletionVector logic from PuffinFile#3491

Merged
sungwy merged 4 commits into
apache:mainfrom
ebyhr:ebi/puffin-refactoring
Jun 25, 2026
Merged

Extract DeletionVector logic from PuffinFile#3491
sungwy merged 4 commits into
apache:mainfrom
ebyhr:ebi/puffin-refactoring

Conversation

@ebyhr

@ebyhr ebyhr commented Jun 13, 2026

Copy link
Copy Markdown
Member

Rationale for this change

PuffinFile handles two tasks: format parsing (magic bytes, footer, blobs) and deletion vector domain logic (bitmap deserialization and PyArrow conversion).
This will become problematic when we introduce support for the NDV apache-datasketches-theta-v1 blob in the future.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes - PuffinFile class user needs to call DeletionVector.

@ebyhr ebyhr force-pushed the ebi/puffin-refactoring branch 3 times, most recently from cf9a2ce to 374d25c Compare June 14, 2026 11:49

@sungwy sungwy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR @ebyhr and I agree this is a good cleanup.

I added some comments

Comment thread pyiceberg/table/deletion_vector.py Outdated
Comment thread pyiceberg/table/puffin.py
@ebyhr

ebyhr commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Thanks for your review! Addressed comments.

Comment thread pyiceberg/table/puffin.py Outdated
@sungwy sungwy requested a review from kevinjqliu June 22, 2026 21:47
@sungwy

sungwy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thanks for this contribution @ebyhr . I've approved it, and I'll leave it open for a day for any additional reviews before we merge.

@moomindani

Copy link
Copy Markdown

Thanks for splitting this out — the read path looks behavior-preserving and the separation is clean.

For the write side, we'd like to coordinate before building on top of this: once it lands we're planning to rebase #3474 (the DV writer, currently PuffinWriter → to be renamed DeletionVectorWriter) on top of it, moving the writer along with its _serialize_bitmaps helper and the DV-specific constants into deletion_vector.py and reusing MAX_JAVA_SIGNED / PROPERTY_REFERENCED_DATA_FILE from here.

Given that, would it make sense to decide now whether the write/serialize path should live on DeletionVector itself (mirroring from_puffin_file / to_vector with a serialize counterpart) or as a separate DeletionVectorWriter class?

Our preference would be to mirror the read-side split — keep the DV bitmap serialization on DeletionVector (as the counterpart to from_puffin_file / to_vector) and leave the generic Puffin-file assembly as a format-level writer in puffin.py, so the write path scales to the future NDV blob the same way the read path now does. That said, we're happy either way.

Comment thread pyiceberg/table/deletion_vector.py Outdated
Comment on lines +32 to +33
class DeletionVector:
_deletion_vectors: dict[str, list[BitMap]]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ebyhr - I thought about this a bit more and I think we'd benefit from revisiting the class design.

A single PuffinFile can hold multiple DVs, and that's read correctly here. But the DeletionVector class ends up encapsulating a map of deletion vectors (keyed by referenced-data-file), which I find a little counterintuitive given the name.

Would it make sense to have DeletionVector map 1-1 to a single DV blob instead? That'd line up with Java, where one DV = one data file's PositionDeleteIndex, loaded via the entry's content_offset/content_size. The from_puffin_file function could then be moved to be a standalone function and return a list of DeletionVectors. WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. My initial motivation was to create a helper class similar to DVUtil in Iceberg Java.

I've updated this PR. Please let me know if the last change doesn't align with your intention.

@ebyhr ebyhr requested a review from sungwy June 24, 2026 20:52
@sungwy

sungwy commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Thanks for splitting this out — the read path looks behavior-preserving and the separation is clean.

For the write side, we'd like to coordinate before building on top of this: once it lands we're planning to rebase #3474 (the DV writer, currently PuffinWriter → to be renamed DeletionVectorWriter) on top of it, moving the writer along with its _serialize_bitmaps helper and the DV-specific constants into deletion_vector.py and reusing MAX_JAVA_SIGNED / PROPERTY_REFERENCED_DATA_FILE from here.

Given that, would it make sense to decide now whether the write/serialize path should live on DeletionVector itself (mirroring from_puffin_file / to_vector with a serialize counterpart) or as a separate DeletionVectorWriter class?

Our preference would be to mirror the read-side split — keep the DV bitmap serialization on DeletionVector (as the counterpart to from_puffin_file / to_vector) and leave the generic Puffin-file assembly as a format-level writer in puffin.py, so the write path scales to the future NDV blob the same way the read path now does. That said, we're happy either way.

Hi @moomindani - thanks for flagging this, and for thinking about how the two paths line up. I'm leaning the same way you are: keep the DV bitmap serialization on DeletionVector as the counterpart to from_puffin_file / to_vector, and leave the generic Puffin assembly (agreed PuffinWriter sounds like a good name) as a format-level writer in puffin.py. That keeps the write path scaling to the future NDV blob the same way the read path now does. Happy to look at #3474 once this lands. Does that work for you?

@moomindani

Copy link
Copy Markdown

Works for me, thanks @sungwy. The 1-1 DeletionVector shape lines up nicely with the writer side too — set_blob maps to a single DV, so the serialize path can sit on DeletionVector as the counterpart to to_vector. I'll build on this once it lands.

@sungwy sungwy merged commit 01a86f9 into apache:main Jun 25, 2026
17 checks passed
@sungwy

sungwy commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

merged. Thanks a lot @ebyhr , and thanks @moomindani for the review!

moomindani added a commit to moomindani/iceberg-python that referenced this pull request Jun 25, 2026
…inWriter

Following the design agreed in apache#3491, split the write path into:
- DeletionVector serialization (domain): from_positions(), _serialize_bitmap()
  and to_blob() as the write-side counterparts of _deserialize_bitmap() and
  to_vector(), reusing MAX_JAVA_SIGNED / PROPERTY_REFERENCED_DATA_FILE.
- A generic, blob-agnostic PuffinWriter (format) that assembles a Puffin file
  from PuffinBlob payloads, mirroring PuffinFile on the read side and scaling
  to future blob types (e.g. apache-datasketches-theta-v1).

PuffinWriter is a context manager (consistent with ManifestWriter); the file
size is available via len(output_file) after exit. The created-by default uses
pyiceberg.__version__.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants