gh-152140: Replace `_Extra` class with `ZipFile._strip_extra_fields()` by danny0838 · Pull Request #152141 · python/cpython

danny0838 · 2026-06-25T00:35:16Z

The _Extra class was over-engineered. Its only active usage was its strip() method, called by ZipFile._write_end_record() to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by _decodeExtra() or other methods efficiently. Additionally, its split() method called _Extra directly rather than utilizing cls, which was a suboptimal pattern that hindered extensibility.

Remove the _Extra class entirely and reimplement it as a private static method _strip_extra_fields() that processes a bytearray inside ZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.

Additionally, bind this method as a class property in the tests to simplify the code.

Issue: Tidy and optimize zipfile module #152140

…elds()` The `_Extra` class was over-engineered. Its only active usage was its `strip()` method, called by `ZipFile._write_end_record()` to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by `_decodeExtra()` or other methods efficiently. Additionally, its `split()` method called `_Extra` directly rather than utilizing `cls`, which was a suboptimal pattern that hindered extensibility. Remove the `_Extra` class entirely and reimplement it as a private static method `_strip_extra_fields()` that processes a bytearray inside `ZipFile`, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.

eendebakpt

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

Early exit for empty extra data: does that occur? If so, how often
_strip_extra_fields now returns a bytarray instead of bytes
The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

eendebakpt · 2026-06-25T07:47:03Z

+                result.extend(mv[pos:pos + 4 + xlen])
+            pos += 4 + xlen
+
+        # keep remaining trailing bytes (e.g. truncated or malformed data)


Why was this added? In the old read_one no data was returned for trailing data with < 4 bytes.

This was the original behavior since read_one returns wrapped original bytes with id=None (and as a result will never be stripped) if the input is less than 4 bytes. The current code just add a comment to make it more explicit.

You can check the tests (especially the test_too_short) to confirm that the behavior is not changed.

danny0838 · 2026-06-25T09:06:57Z

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

Early exit for empty extra data: does that occur? If so, how often

It depends. Other ZIP tools may add Linux permission or extended date, while Python zipfile doesn't create extra for non-zip64 entries and ignores most extra fields, and it should be very often for the latter case.

_strip_extra_fields now returns a bytarray instead of bytes

Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.

The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

This is NOT a behavior change, as responsed in another comment.

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing?

Since `struct.unpack_from` is already an offset-based approach at the C level, and `result.extend` requires data copy anyway, avoiding a `memoryview` wrapper prevents redundant Python object allocation and pointer-shifting overhead, yielding optimal runtime memory footprint and CPU performance.

danny0838 · 2026-06-25T11:10:31Z

Here's the benchmark:

import timeit
import struct
import random

random.seed(42)

class _Extra(bytes):
    FIELD_STRUCT = struct.Struct('<HH')

    def __new__(cls, val, id=None):
        return super().__new__(cls, val)

    def __init__(self, val, id=None):
        self.id = id

    @classmethod
    def read_one(cls, raw):
        try:
            xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
        except struct.error:
            xid = None
            xlen = 0
        return cls(raw[:4+xlen], xid), raw[4+xlen:]

    @classmethod
    def split(cls, data):
        # use memoryview for zero-copy slices
        rest = memoryview(data)
        while rest:
            extra, rest = _Extra.read_one(rest)
            yield extra

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        return b''.join(
            ex
            for ex in cls.split(data)
            if ex.id not in xids
        )

def strip_extra_fields(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

s = struct.Struct("<HH")
valid_extra_data = (
    s.pack(0x0001, 24) + b"X" * 24 +  # ZIP64 Extended Information
    s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
    s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
    s.pack(0x0001, 16) + b"Y" * 16 +  # ZIP64
    s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""

dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)

def run_legacy():
    for data in dataset:
        _Extra.strip(data, (1,))

def run_new():
    for data in dataset:
        strip_extra_fields(data, (1,))

def run_new2():
    for data in dataset:
        strip_extra_fields_no_early_return(data, (1,))

def run_new3():
    for data in dataset:
        strip_extra_fields_no_mv(data, (1,))

def run_new4():
    for data in dataset:
        strip_extra_fields_no_mv_no_early_return(data, (1,))

NUMBER = 10000

legacy_time = timeit.timeit(stmt="run_legacy()", globals=globals(), number=NUMBER)
new_time = timeit.timeit(stmt="run_new()", globals=globals(), number=NUMBER)
new_time2 = timeit.timeit(stmt="run_new2()", globals=globals(), number=NUMBER)
new_time3 = timeit.timeit(stmt="run_new3()", globals=globals(), number=NUMBER)
new_time4 = timeit.timeit(stmt="run_new4()", globals=globals(), number=NUMBER)

print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
print(f"Legacy (_Extra class): {legacy_time:.4f} seconds")
print(f"New (static method): {new_time:.4f} seconds")
print(f"New2 (no early return): {new_time2:.4f} seconds")
print(f"New3 (no memoryview): {new_time3:.4f} seconds")
print(f"New4 (no memoryview, no early return): {new_time4:.4f} seconds")
print(f"Speedup: {legacy_time / new_time3:.2f}x")


import tracemalloc

def profile_memory(func_to_run):
    tracemalloc.start()
    for _ in range(50):
        func_to_run()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

peak_legacy = profile_memory(run_legacy)
peak_new = profile_memory(run_new)
peak_new2 = profile_memory(run_new2)
peak_new3 = profile_memory(run_new3)
peak_new4 = profile_memory(run_new4)

print(f"\n=== Memory Peak Results ===")
print(f"Legacy Peak Memory: {peak_legacy:,} bytes")
print(f"New Peak Memory: {peak_new:,} bytes")
print(f"New2 Peak Memory: {peak_new2:,} bytes")
print(f"New3 Peak Memory: {peak_new3:,} bytes")
print(f"New4 Peak Memory: {peak_new4:,} bytes")
print(f"Memory Saved: {((peak_legacy - peak_new3) / peak_legacy) * 100:.1f}%")

and the results (run on Windows 10, Python 3.16.0a0 at commit 4eae06b):

=== Benchmark Results (2,000,000 total invocations) ===
=== Benchmark Results (2,000,000 total invocations) ===
Dataset Profile: 50% Valid Extra Data, 50% Empty Data
Legacy (_Extra class): 12.1356 seconds
New (static method): 3.5566 seconds
New2 (no early return): 3.6742 seconds
New3 (no memoryview): 3.0106 seconds
New4 (no memoryview, no early return): 3.1620 seconds
Speedup: 4.03x

=== Memory Peak Results ===
Legacy Peak Memory: 2,408 bytes
New Peak Memory: 677 bytes
New2 Peak Memory: 677 bytes
New3 Peak Memory: 245 bytes
New4 Peak Memory: 245 bytes
Memory Saved: 89.8%

The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), and removing memoryview further improves performance (speedup 4.05x, memory saved 89.8%, compared with legacy implementation). Accordingly, I removed the memoryview usage in the latest commit.

bedevere-app Bot mentioned this pull request Jun 25, 2026

Tidy and optimize zipfile module #152140

Open

bedevere-app Bot added the awaiting review label Jun 25, 2026

blurb-it Bot and others added 2 commits June 25, 2026 00:42

📜🤖 Added by blurb_it.

f334193

Make strip_extra a direct class property in the test

4eae06b

eendebakpt reviewed Jun 25, 2026

View reviewed changes

Comment thread Lib/zipfile/__init__.py Outdated

eendebakpt reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-152140: Replace `_Extra` class with `ZipFile._strip_extra_fields()`#152141

gh-152140: Replace `_Extra` class with `ZipFile._strip_extra_fields()`#152141
danny0838 wants to merge 4 commits into
python:mainfrom
danny0838:gh-152140

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

eendebakpt left a comment

Uh oh!

Uh oh!

eendebakpt Jun 25, 2026

Uh oh!

danny0838 Jun 25, 2026 •

edited

Loading

Uh oh!

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eendebakpt Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

danny0838 Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danny0838 commented Jun 25, 2026 •

edited

Loading

danny0838 Jun 25, 2026 •

edited

Loading

danny0838 commented Jun 25, 2026 •

edited

Loading

danny0838 commented Jun 25, 2026 •

edited

Loading