gh-152140: Replace _Extra class with ZipFile._strip_extra_fields()#152141
gh-152140: Replace _Extra class with ZipFile._strip_extra_fields()#152141danny0838 wants to merge 4 commits into
_Extra class with ZipFile._strip_extra_fields()#152141Conversation
…elds()` The `_Extra` class was over-engineered. Its only active usage was its `strip()` method, called by `ZipFile._write_end_record()` to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by `_decodeExtra()` or other methods efficiently. Additionally, its `split()` method called `_Extra` directly rather than utilizing `cls`, which was a suboptimal pattern that hindered extensibility. Remove the `_Extra` class entirely and reimplement it as a private static method `_strip_extra_fields()` that processes a bytearray inside `ZipFile`, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.
eendebakpt
left a comment
There was a problem hiding this comment.
The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:
- Early exit for empty extra data: does that occur? If so, how often
- _strip_extra_fields now returns a bytarray instead of bytes
- The parsing of the data is implemented different now (e.g. we need an additional
keep remaining trailing bytes)
Can you make benchmarks to show the new code is indeed faster (or at least no regression)?
| result.extend(mv[pos:pos + 4 + xlen]) | ||
| pos += 4 + xlen | ||
|
|
||
| # keep remaining trailing bytes (e.g. truncated or malformed data) |
There was a problem hiding this comment.
Why was this added? In the old read_one no data was returned for trailing data with < 4 bytes.
There was a problem hiding this comment.
This was the original behavior since read_one returns wrapped original bytes with id=None (and as a result will never be stripped) if the input is less than 4 bytes. The current code just add a comment to make it more explicit.
You can check the tests (especially the test_too_short) to confirm that the behavior is not changed.
It depends. Other ZIP tools may add Linux permission or extended date, while Python
Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.
This is NOT a behavior change, as responsed in another comment.
I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing? |
Since `struct.unpack_from` is already an offset-based approach at the C level, and `result.extend` requires data copy anyway, avoiding a `memoryview` wrapper prevents redundant Python object allocation and pointer-shifting overhead, yielding optimal runtime memory footprint and CPU performance.
|
Here's the benchmark: import timeit
import struct
import random
random.seed(42)
class _Extra(bytes):
FIELD_STRUCT = struct.Struct('<HH')
def __new__(cls, val, id=None):
return super().__new__(cls, val)
def __init__(self, val, id=None):
self.id = id
@classmethod
def read_one(cls, raw):
try:
xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
except struct.error:
xid = None
xlen = 0
return cls(raw[:4+xlen], xid), raw[4+xlen:]
@classmethod
def split(cls, data):
# use memoryview for zero-copy slices
rest = memoryview(data)
while rest:
extra, rest = _Extra.read_one(rest)
yield extra
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs."""
return b''.join(
ex
for ex in cls.split(data)
if ex.id not in xids
)
def strip_extra_fields(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# early return for empty extra data
if not data:
return result
# use memoryview for zero-copy slices
mv = memoryview(data)
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_early_return(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# use memoryview for zero-copy slices
mv = memoryview(data)
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_mv(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# early return for empty extra data
if not data:
return result
mv = data
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_mv_no_early_return(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
mv = data
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
s = struct.Struct("<HH")
valid_extra_data = (
s.pack(0x0001, 24) + b"X" * 24 + # ZIP64 Extended Information
s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
s.pack(0x0001, 16) + b"Y" * 16 + # ZIP64
s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""
dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)
def run_legacy():
for data in dataset:
_Extra.strip(data, (1,))
def run_new():
for data in dataset:
strip_extra_fields(data, (1,))
def run_new2():
for data in dataset:
strip_extra_fields_no_early_return(data, (1,))
def run_new3():
for data in dataset:
strip_extra_fields_no_mv(data, (1,))
def run_new4():
for data in dataset:
strip_extra_fields_no_mv_no_early_return(data, (1,))
NUMBER = 10000
legacy_time = timeit.timeit(stmt="run_legacy()", globals=globals(), number=NUMBER)
new_time = timeit.timeit(stmt="run_new()", globals=globals(), number=NUMBER)
new_time2 = timeit.timeit(stmt="run_new2()", globals=globals(), number=NUMBER)
new_time3 = timeit.timeit(stmt="run_new3()", globals=globals(), number=NUMBER)
new_time4 = timeit.timeit(stmt="run_new4()", globals=globals(), number=NUMBER)
print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
print(f"Legacy (_Extra class): {legacy_time:.4f} seconds")
print(f"New (static method): {new_time:.4f} seconds")
print(f"New2 (no early return): {new_time2:.4f} seconds")
print(f"New3 (no memoryview): {new_time3:.4f} seconds")
print(f"New4 (no memoryview, no early return): {new_time4:.4f} seconds")
print(f"Speedup: {legacy_time / new_time3:.2f}x")
import tracemalloc
def profile_memory(func_to_run):
tracemalloc.start()
for _ in range(50):
func_to_run()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
return peak
peak_legacy = profile_memory(run_legacy)
peak_new = profile_memory(run_new)
peak_new2 = profile_memory(run_new2)
peak_new3 = profile_memory(run_new3)
peak_new4 = profile_memory(run_new4)
print(f"\n=== Memory Peak Results ===")
print(f"Legacy Peak Memory: {peak_legacy:,} bytes")
print(f"New Peak Memory: {peak_new:,} bytes")
print(f"New2 Peak Memory: {peak_new2:,} bytes")
print(f"New3 Peak Memory: {peak_new3:,} bytes")
print(f"New4 Peak Memory: {peak_new4:,} bytes")
print(f"Memory Saved: {((peak_legacy - peak_new3) / peak_legacy) * 100:.1f}%")and the results (run on Windows 10, Python 3.16.0a0 at commit 4eae06b): The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), and removing |
The
_Extraclass was over-engineered. Its only active usage was itsstrip()method, called byZipFile._write_end_record()to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by_decodeExtra()or other methods efficiently. Additionally, itssplit()method called_Extradirectly rather than utilizingcls, which was a suboptimal pattern that hindered extensibility.Remove the
_Extraclass entirely and reimplement it as a private static method_strip_extra_fields()that processes a bytearray insideZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.Additionally, bind this method as a class property in the tests to simplify the code.
zipfilemodule #152140