fix(#1469): codec-driven GC reference discovery + file-level orphan matching#1479
Open
dimitri-yatsenko wants to merge 1 commit into
Open
fix(#1469): codec-driven GC reference discovery + file-level orphan matching#1479dimitri-yatsenko wants to merge 1 commit into
dimitri-yatsenko wants to merge 1 commit into
Conversation
…rphan matching Garbage collection deleted external-store files belonging to LIVE rows in two ways, both fixed here: 1. Custom-codec blindness. gc.scan recognized schema-addressed columns by hardcoded codec name (object/npy), so a custom SchemaCodec subclass (e.g. a NetCDF codec, #1469) was never scanned — its live files were reported orphaned and collect() deleted them. Recognition is now by type (isinstance(codec, SchemaCodec)), and reference extraction moves to a codec-owned hook, Codec.referenced_paths(stored). Custom SchemaCodec subclasses inherit correct behavior for free; fully custom codecs override. 2. Path-format mismatch (also hit built-in <object@>/<npy@>). A row's stored metadata references an object FILE ({schema}/{table}/{pk}/{field}_{token}), but list_schema_paths enumerated the enclosing DIRECTORY, so the referenced and stored path sets never matched and live objects were flagged orphaned. list_schema_paths now enumerates files (matching the referenced paths, with per-token granularity) and delete_schema_path removes the single orphaned file and prunes empty parent dirs. Delete-then-GC semantics unchanged (files survive row delete by design; GC reclaims). Adds Codec.referenced_paths to the base contract (default recognizes standard {path, store} metadata). Tests: recognition of custom SchemaCodec subclasses; end-to-end guard that collect() keeps a surviving row's file and reclaims only the deleted row's file (both fail on pre-fix code). Existing object/npy/hash suites still pass. Discovery layer of the trustworthy-GC work (#1478); #1445 builds on it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1469 — garbage collection deleting external-store files that belong to existing rows. Implements the plan from #1469 / #1478.
Two root causes, both fixed
1. Custom-codec blindness (the reported bug).
gc.scanrecognized schema-addressed columns by hardcoded codec name (object/npy), so a customSchemaCodecsubclass (the reporter's NetCDF codec) was never scanned — its live files were reported orphaned andcollect()deleted them. Recognition is now by type (isinstance(codec, SchemaCodec)), and reference extraction moves to a codec-owned hook:Custom
SchemaCodecsubclasses inherit correct behavior for free; fully custom codecs can override.scancallsattr.codec.referenced_paths(value)per column instead of switching on name.2. Path-format mismatch (latent — also hit built-in
<object@>/<npy@>). A row's stored metadata references an object file ({schema}/{table}/{pk}/{field}_{token}), butlist_schema_pathsenumerated the enclosing directory — so the referenced and stored path sets never matched and every live schema-addressed object was flagged orphaned (existing tests only assertedreferenced >= 1, neverorphaned == 0, so it was hidden).list_schema_pathsnow enumerates files (matching referenced paths, with per-token granularity), anddelete_schema_pathremoves the single orphaned file and prunes empty parent dirs.Not changed
Delete-then-GC semantics are unchanged — external files still survive row delete by design; GC reclaims. No delete-time eager cleanup (unsafe for hash dedup).
Tests
SchemaCodecsubclass (_uses_schema_storage).collect(), the surviving row's file remains and only the deleted row's file is reclaimed — asserted by exact path (robust to the shared test store). Both new tests fail on pre-fix code.test_gc.py(32) + object/npy/hash/codec/adapter suites (237) pass locally on MySQL & PostgreSQL.Discovery layer of the trustworthy-GC work (#1478); #1445 (two-phase quarantine/grace/purge) builds on the now-trustworthy scan.