Skip to content

fix: deduplicate ZipStore central directory on close#3858

Open
nuglifeleoji wants to merge 5 commits intozarr-developers:mainfrom
nuglifeleoji:fix/zipstore-dedup-on-close
Open

fix: deduplicate ZipStore central directory on close#3858
nuglifeleoji wants to merge 5 commits intozarr-developers:mainfrom
nuglifeleoji:fix/zipstore-dedup-on-close

Conversation

@nuglifeleoji
Copy link
Copy Markdown
Contributor

ZIP files are append-only: writing the same key a second time adds a new entry without removing the old one. When an array is resized, zarr.json is rewritten, which accumulates duplicate filenames in the central directory and produces:

UserWarning: Duplicate name: 'zarr.json'

Many zip readers (e.g. KDE Ark, Windows Explorer) show only the first entry for a duplicated name, making the stored data appear stale or corrupt. The duplicates also waste space.

Fix: ZipStore.close() now calls _dedup_central_directory() before closing the file. It scans filelist in reverse and keeps only the last (most recent) entry for each filename, so the on-disk central directory is clean. This follows the approach suggested in the issue by @mkitti.

store = ZipStore("data.zip", mode="w")
arr = zarr.create_array(store, shape=(5,), dtype="i4")
arr.resize((10,))   # rewrites zarr.json
store.close()

# Before: zarr.json appeared twice in the central directory
# After:  only one entry for zarr.json

Closes #3580

Made with Cursor

ZIP files are append-only: writing the same key (e.g. zarr.json) a
second time adds a new entry without removing the old one.  When an
array is resized, zarr.json is rewritten, which accumulates duplicates
in the central directory and triggers UserWarning: Duplicate name.

On close(), _dedup_central_directory() now scans filelist in reverse
and keeps only the last (most recent) entry for each filename, so the
on-disk central directory is clean.

Closes zarr-developers#3580

Made-with: Cursor
@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Mar 31, 2026

what happens if that scan is interrupted?

@nuglifeleoji
Copy link
Copy Markdown
Contributor Author

Good point — the original code updated filelist and NameToInfo in two separate steps, so an interrupt between them would leave the object inconsistent. Fixed in the latest commit: both new structures are built upfront, then swapped in together, so if anything raises before the swap the original state is still intact.

Python's zipfile warns when the same filename is written twice.
ZipStore is append-only by design; duplicates are cleaned up by
_dedup_central_directory() on close.  Suppress the warning at the
source so test suites with filterwarnings=error don't treat it as
a failure.

Made-with: Cursor
@nuglifeleoji
Copy link
Copy Markdown
Contributor Author

The CI failures are caused by tests that expected the 'Duplicate name' UserWarning to still be emitted. Since this PR deduplicates entries on close, the warning is no longer raised and those tests fail with 'DID NOT WARN'.

Two options:

  1. Keep the dedup silent (current approach) and update all test assertions to not expect the warning
  2. Emit the warning from close() when dedup is needed, so existing tests pass

Which approach would you prefer? Happy to update accordingly.

Copy link
Copy Markdown
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NameToInfo already has a unique set of filenames in its keys and the latest zipinfo entries as its values, I believe. Please double check this.

If so, we could just NameToInfo to update the filelist. By using zipfile.getinfo to retrieve information from NameToInfo we could implement everything using public API.

return
seen: set[str] = set()
deduped: list[zipfile.ZipInfo] = []
for info in reversed(self._zf.filelist):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for info in reversed(self._zf.filelist):
for info in reversed(self._zf.infolist()):

Rather than accessing filelist directly here, use the public API infolist().

Comment on lines +151 to +156
new_filelist = list(reversed(deduped))
new_name_to_info = {info.filename: info for info in new_filelist}
# Swap both attributes together; if anything above raised, the
# original filelist/NameToInfo are still intact.
self._zf.filelist = new_filelist
self._zf.NameToInfo = new_name_to_info
Copy link
Copy Markdown
Contributor

@mkitti mkitti Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at NameToInfo I have a few questions and observations:

  1. Rather than constructing seen, a set, and deduped, a list, would it not be easier to just construct a dict?
  2. That dict seems equivalent to NameToInfo.
  3. The original NameToInfo seems to have what we want already. It contains a unique set of filenames as its keys and the latest version of each file as its values (double check this).
  4. Perhaps all that we need to do is set the contents of filelist to the values of NameToInfo?
self._zf.filelist = list(self._zf.NameToInfo.values())

There might even be a way to do this completely with the public API.

# zipfile.infolist() is public API
_filelist = self._zf.infolist()
# Make sure that the public API returns the SAME Python object as zipfile.filelist, which is NOT public API
assert _filelist is self._zf.filelist, "The assumption that zipfile.infolist() returns zipfile.filelist is not valid"
# zipfile.namelist() is public API
_namelist = np.unique(self._zf.namelist())
# Manipulate the list in place
_filelist.clear()
# zipfile.getinfo is just NameToInfo.get
_filelist.extend({self._zf.getinfo(name) for name in _namelist})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @mkitti!

I checked the CPython zipfile source and the NameToInfo is updated via dict assignment (NameToInfo[name] = zinfo), so later writes always overwrite earlier ones. This means NameToInfo already holds the latest ZipInfo for each filename. I also confirmed that infolist() returns self.filelist directly (not a copy), and getinfo(name) is just NameToInfo[name].

What I changed:
Since NameToInfo is already correct, only filelist needs to be rebuilt. I replaced the seen/deduped loop with:
unique_names = dict.fromkeys(self._zf.namelist()) # public API, preserves order, deduplicates
self._zf.filelist = [self._zf.getinfo(name) for name in unique_names] # public API
NameToInfo no longer needs to be updated at all. The new list is computed before assignment, so atomicity is preserved.

One note on your public-API snippet:
The set comprehension {self._zf.getinfo(name) for name in _namelist} loses insertion order, and np.unique() sorts names alphabetically — both could reorder the entries in filelist. I used dict.fromkeys() instead to deduplicate while preserving first-seen order, with no numpy dependency needed.

Thanks so much on your review, and I am really grateful and eager to solve other problems

NameToInfo is already a {filename -> latest ZipInfo} mapping because
each writestr/write call does NameToInfo[name] = zinfo, so later writes
overwrite earlier ones (confirmed via CPython zipfile source).

Replace the manual seen/deduped loop with a two-liner that uses public API:
- namelist() to get all filenames (preserving insertion order)
- dict.fromkeys() to deduplicate while keeping first-seen order
- getinfo() to retrieve the latest ZipInfo for each unique name

Only filelist needs to be updated; NameToInfo is already correct and
does not need to be rewritten. The new list is computed before assignment
so the ZipFile is never left in an inconsistent state on exception.

Made-with: Cursor
@nuglifeleoji nuglifeleoji force-pushed the fix/zipstore-dedup-on-close branch from 406e49d to eebe867 Compare April 7, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resizing an array in a ZipFileStore will lead to many duplicate entries for zarr.json in the ZipFile.

3 participants