Skip to content

gh-152140: Replace _Extra class with ZipFile._strip_extra_fields()#152141

Open
danny0838 wants to merge 4 commits into
python:mainfrom
danny0838:gh-152140
Open

gh-152140: Replace _Extra class with ZipFile._strip_extra_fields()#152141
danny0838 wants to merge 4 commits into
python:mainfrom
danny0838:gh-152140

Conversation

@danny0838

@danny0838 danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The _Extra class was over-engineered. Its only active usage was its strip() method, called by ZipFile._write_end_record() to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by _decodeExtra() or other methods efficiently. Additionally, its split() method called _Extra directly rather than utilizing cls, which was a suboptimal pattern that hindered extensibility.

Remove the _Extra class entirely and reimplement it as a private static method _strip_extra_fields() that processes a bytearray inside ZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.

Additionally, bind this method as a class property in the tests to simplify the code.

…elds()`

The `_Extra` class was over-engineered. Its only active usage was its
`strip()` method, called by `ZipFile._write_end_record()` to provide
the stripping logic for ZIP64 fields.  The context-dependent nature of
extra fields also made it difficult to be reused by `_decodeExtra()`
or other methods efficiently.  Additionally, its `split()` method
called `_Extra` directly rather than utilizing `cls`, which was a
suboptimal pattern that hindered extensibility.

Remove the `_Extra` class entirely and reimplement it as a private
static method `_strip_extra_fields()` that processes a bytearray
inside `ZipFile`, positioned directly beneath its caller.  This
eliminates dead and suboptimal code, achieves clean encapsulation
and code locality, and improves performance by avoiding temporary
class allocations.

@eendebakpt eendebakpt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

  • Early exit for empty extra data: does that occur? If so, how often
  • _strip_extra_fields now returns a bytarray instead of bytes
  • The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

Comment thread Lib/zipfile/__init__.py Outdated
Comment thread Lib/zipfile/__init__.py
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen

# keep remaining trailing bytes (e.g. truncated or malformed data)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added? In the old read_one no data was returned for trailing data with < 4 bytes.

@danny0838 danny0838 Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original behavior since read_one returns wrapped original bytes with id=None (and as a result will never be stripped) if the input is less than 4 bytes. The current code just add a comment to make it more explicit.

You can check the tests (especially the test_too_short) to confirm that the behavior is not changed.

@danny0838

danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

  • Early exit for empty extra data: does that occur? If so, how often

It depends. Other ZIP tools may add Linux permission or extended date, while Python zipfile doesn't create extra for non-zip64 entries and ignores most extra fields, and it should be very often for the latter case.

  • _strip_extra_fields now returns a bytarray instead of bytes

Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.

  • The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

This is NOT a behavior change, as responsed in another comment.

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing?

Since `struct.unpack_from` is already an offset-based approach at the
C level, and `result.extend` requires data copy anyway, avoiding a
`memoryview` wrapper prevents redundant Python object allocation and
pointer-shifting overhead, yielding optimal runtime memory footprint
and CPU performance.
@danny0838

danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Here's the benchmark:

import timeit
import struct
import random

random.seed(42)

class _Extra(bytes):
    FIELD_STRUCT = struct.Struct('<HH')

    def __new__(cls, val, id=None):
        return super().__new__(cls, val)

    def __init__(self, val, id=None):
        self.id = id

    @classmethod
    def read_one(cls, raw):
        try:
            xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
        except struct.error:
            xid = None
            xlen = 0
        return cls(raw[:4+xlen], xid), raw[4+xlen:]

    @classmethod
    def split(cls, data):
        # use memoryview for zero-copy slices
        rest = memoryview(data)
        while rest:
            extra, rest = _Extra.read_one(rest)
            yield extra

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        return b''.join(
            ex
            for ex in cls.split(data)
            if ex.id not in xids
        )

def strip_extra_fields(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

s = struct.Struct("<HH")
valid_extra_data = (
    s.pack(0x0001, 24) + b"X" * 24 +  # ZIP64 Extended Information
    s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
    s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
    s.pack(0x0001, 16) + b"Y" * 16 +  # ZIP64
    s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""

dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)

def run_legacy():
    for data in dataset:
        _Extra.strip(data, (1,))

def run_new():
    for data in dataset:
        strip_extra_fields(data, (1,))

def run_new2():
    for data in dataset:
        strip_extra_fields_no_early_return(data, (1,))

def run_new3():
    for data in dataset:
        strip_extra_fields_no_mv(data, (1,))

def run_new4():
    for data in dataset:
        strip_extra_fields_no_mv_no_early_return(data, (1,))

NUMBER = 10000

legacy_time = timeit.timeit(stmt="run_legacy()", globals=globals(), number=NUMBER)
new_time = timeit.timeit(stmt="run_new()", globals=globals(), number=NUMBER)
new_time2 = timeit.timeit(stmt="run_new2()", globals=globals(), number=NUMBER)
new_time3 = timeit.timeit(stmt="run_new3()", globals=globals(), number=NUMBER)
new_time4 = timeit.timeit(stmt="run_new4()", globals=globals(), number=NUMBER)

print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
print(f"Legacy (_Extra class): {legacy_time:.4f} seconds")
print(f"New (static method): {new_time:.4f} seconds")
print(f"New2 (no early return): {new_time2:.4f} seconds")
print(f"New3 (no memoryview): {new_time3:.4f} seconds")
print(f"New4 (no memoryview, no early return): {new_time4:.4f} seconds")
print(f"Speedup: {legacy_time / new_time3:.2f}x")


import tracemalloc

def profile_memory(func_to_run):
    tracemalloc.start()
    for _ in range(50):
        func_to_run()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

peak_legacy = profile_memory(run_legacy)
peak_new = profile_memory(run_new)
peak_new2 = profile_memory(run_new2)
peak_new3 = profile_memory(run_new3)
peak_new4 = profile_memory(run_new4)

print(f"\n=== Memory Peak Results ===")
print(f"Legacy Peak Memory: {peak_legacy:,} bytes")
print(f"New Peak Memory: {peak_new:,} bytes")
print(f"New2 Peak Memory: {peak_new2:,} bytes")
print(f"New3 Peak Memory: {peak_new3:,} bytes")
print(f"New4 Peak Memory: {peak_new4:,} bytes")
print(f"Memory Saved: {((peak_legacy - peak_new3) / peak_legacy) * 100:.1f}%")

and the results (run on Windows 10, Python 3.16.0a0 at commit 4eae06b):

=== Benchmark Results (2,000,000 total invocations) ===
=== Benchmark Results (2,000,000 total invocations) ===
Dataset Profile: 50% Valid Extra Data, 50% Empty Data
Legacy (_Extra class): 12.1356 seconds
New (static method): 3.5566 seconds
New2 (no early return): 3.6742 seconds
New3 (no memoryview): 3.0106 seconds
New4 (no memoryview, no early return): 3.1620 seconds
Speedup: 4.03x

=== Memory Peak Results ===
Legacy Peak Memory: 2,408 bytes
New Peak Memory: 677 bytes
New2 Peak Memory: 677 bytes
New3 Peak Memory: 245 bytes
New4 Peak Memory: 245 bytes
Memory Saved: 89.8%

The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), and removing memoryview further improves performance (speedup 4.05x, memory saved 89.8%, compared with legacy implementation). Accordingly, I removed the memoryview usage in the latest commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants