Skip to content

Drop cache_loaded_ flag; gate init on metadata.empty()#20245

Open
doggeral wants to merge 3 commits into
pytorch:mainfrom
doggeral:export-D108178967
Open

Drop cache_loaded_ flag; gate init on metadata.empty()#20245
doggeral wants to merge 3 commits into
pytorch:mainfrom
doggeral:export-D108178967

Conversation

@doggeral

Copy link
Copy Markdown
Contributor

Summary:
Follow-up to D106717093. The cache_loaded_ flag was used as the gate
in initialize_for_runtime to decide between the CACHE_LOADED
fast-path (reopen fd only) and the load-or-fresh-write slow path.
But it was only set true by load_packed_cache(), not by the
fresh-write path. After a fresh-write + save_packed_index closed
the fd, the next init found cache_loaded_=false → re-entered
load_packed_cache → push a second whole-file mmap region on top
of the per-entry regions from reserve_space. Same file mapped
twice; redundant VM and bookkeeping.

Replace the flag with !name_to_packed_data_metadata_.empty() as
the gate. Metadata is non-empty after BOTH paths (load and
fresh-write), so the gate is correct for both. Removes 1 field,
3 writes, and the latent double-mmap on the fresh-write-then-save
path. The from_load entry comment is updated to drop the
now-stale reference to cache_loaded_ auto-invalidation.

Differential Revision: D108178967

doggeral added 3 commits June 10, 2026 07:22
Summary:
Add cross-load reuse + multi-PTE safety to the file-backed packed weight cache (D106673663). The first PTE in a session calls `save_packed_index()` to append a trailer; subsequent process launches mmap the file and pre-populate `name_to_packed_data_metadata_` so `look_up()` hits for every saved weight and `xnn_create_runtime` skips packing entirely.

## Cache file format

```
[packed data regions]                          (written by reserve_space)
[index entries]                                (written by save_packed_index)
  each: name_len(4B) | name(N) | file_offset(8B) | data_size(8B)
[footer: 20 bytes]
  index_start(8B) | entry_count(4B) | magic "XPWC"(4B) | version(4B)
```

## Lifecycle invariants

- `cache_loaded_` gate: `load_packed_cache()` runs at most once per process per path. Subsequent PTE inits for the same path reopen the write fd without re-reading the trailer.
- `from_load` flag: persistent entries (loaded from trailer or promoted on save) skip `delete_packed_data` cleanup. This keeps the mmap region and metadata alive across PTE unload/reload, so the next init hits the cache instead of repacking. Without this, every PTE destroy/recreate cycle appended a fresh copy to the file (~450 MB per cycle).
- No-op save short-circuit: `save_packed_index` returns early when no new `reserve_space` happened since the last save, avoiding the mtime churn that previously made the cache file look modified on every model load.

## Multi-PTE behavior

- Multiple PTEs (or methods that don't share weights) in the same model load share one cache file. Each PTE's `reserve_space` extends the file; `finalize_for_runtime` msyncs only newly added regions; `save_packed_index` writes one trailer covering all PTEs at the end of the load.
- Sibling PTEs that opt out of the mmap path (caller passes empty `packed_cache_path`) early-return from `initialize_for_runtime` and fall through to heap allocation, without touching the singleton's PLLM state.
- Cross-model coexistence relies on caller-side discipline: only models that opt in set a non-empty cache path. Setting different non-empty paths concurrently is not supported by this singleton design.

## Caller change

`XNNPACKBackend::init` always calls `set_packed_cache_path` (with empty string for non-opted-in PTEs). This keeps the singleton path in sync with the current PTE instead of inheriting a sibling's path.

## Test Plan

```
buck2 test fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache  # 5 pass
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backendApple
buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend
buck2 build fbcode//executorch/backends/xnnpack:xnnpack_backend
```

On device (iOS Stella build, PLLM + Llama3 runner):
- Cold start: load `(1184 entries)` from cache, `reserve_mmap=0` for cached weights
- Cache file size stable at ~593 MB across PLLM unload/reload cycles
- `app_peak ~700 MB` (vs ~2.5 GB pre-fix)
- `compressed ~100 MB` (vs ~1.7 GB pre-fix)

Differential Revision: D106717093
Summary:
XNNPACK exposes `xnn_weights_cache_look_up_key.seed` — a per-ukernel value that XNNPACK guarantees is consistent across runs of the same ukernel and changes whenever a ukernel implementation changes. Store this seed per cache entry so a stale cached packing produced by an old XNNPACK ukernel is rejected after upgrade, instead of being handed back to a newer ukernel that expects a different layout.

Changes:
- `PackedDataMeta` gains `uint32_t seed{0}`.
- `look_up` rejects (returns `SIZE_MAX`) when a name hit has a stored seed that doesn't match `cache_key->seed`. This forces `look_up_or_insert` to re-pack with the current ukernel and avoids the slow `memcmp` path catching it later.
- `look_up_or_insert` records `cache_key->seed` on insert.
- On-disk index entry layout extended to `[name_len:u32][name][file_offset:u64][data_size:u64][seed:u32]` (was 16 bytes after the name, now 20).
- `load_packed_cache` reads the per-entry seed and bumps the trailing bytes bound check accordingly.
- `kCacheVersion` bumped 1 → 2 so existing v1 files (which carry no seed) are rejected at load instead of being loaded with `seed=0` and mismatching every fresh `look_up`.

Cleanup of orphaned in-memory and on-disk entries left by an invalidated look-up is a follow-up — this diff only adds the detection.

Differential Revision: D108082431
Summary:
Follow-up to D106717093. The cache_loaded_ flag was used as the gate
in initialize_for_runtime to decide between the CACHE_LOADED
fast-path (reopen fd only) and the load-or-fresh-write slow path.
But it was only set true by load_packed_cache(), not by the
fresh-write path. After a fresh-write + save_packed_index closed
the fd, the next init found cache_loaded_=false → re-entered
load_packed_cache → push a second whole-file mmap region on top
of the per-entry regions from reserve_space. Same file mapped
twice; redundant VM and bookkeeping.

Replace the flag with !name_to_packed_data_metadata_.empty() as
the gate. Metadata is non-empty after BOTH paths (load and
fresh-write), so the gate is correct for both. Removes 1 field,
3 writes, and the latent double-mmap on the fresh-write-then-save
path. The from_load entry comment is updated to drop the
now-stale reference to cache_loaded_ auto-invalidation.

Differential Revision: D108178967
@doggeral doggeral requested a review from digantdesai as a code owner June 12, 2026 20:43
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20245

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 4 New Failures, 1 Pending

As of commit 7889e6b with merge base 0cbece4 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@meta-codesync

meta-codesync Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@doggeral has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108178967.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant