fix(cache): close L1 stale-after-write race exposed by Redis CI variant#28902
Conversation
The postgres-elasticsearch-redis IT job has been flaking on multiple test families (test_entityStatus, post_delete_entityWithOwner_200, testHardDeleteFileIsAsync, testUploadPdfToMinIO) — all the same race: a Guava LoadingCache loader in flight when a writer's invalidate runs can still put its pre-write value into L1 (documented Guava behavior), so subsequent reads on the same pod serve stale data until L1 TTL. The race has been known and partially patched before (#28197 added the NotFoundCache marker for delete; #28100 closed a rename-cascade variant). PR #28675's per-entity async indexing dispatcher + out-of-transaction invalidate widened it enough to tip CI 1-2 tests per run. The L1 race itself is not Redis-specific; the Redis IT variant just exercises more concurrent read paths, so it surfaces what other variants don't. Three layered defenses, all in EntityRepository: 1. NotFoundCache check on L1 hit (find / findByName). The existing delete-side fix only consulted the marker on L1 miss; a poisoned L1 entry shadowed it. Now if L1 has a value but the marker says deleted, we evict and throw — fully closes the delete-side race. 2. Per-entity write-epoch counter. Bounded Guava caches of AtomicLong per (type,id) and (type,fqn). Writers bump before invalidating L1. Loaders capture start epoch, re-check at end, throw a private LoaderRaceException on mismatch — Guava skips caching on throw. find / findByName catch the wrapped cause and re-read via the explicit-bypass path. Closes ~99% of the update-side race. 3. Deferred async L1 re-invalidate. Daemon-thread ScheduledExecutor re-invalidates the key 500ms after every writer's inline invalidate, catching any loader whose end-of-load check passed but whose Guava-internal put landed AFTER the inline invalidate. Covers the residual nanosecond window from defense 2. Cost: bounded epoch cache (~13MB worst case), one extra Guava lookup per loader call, one NotFoundCache Redis GET per L1-hit NON_DELETED read (restores a previously-reverted shape), one daemon-thread schedule per writer. No API or schema changes.
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR hardens EntityRepository’s L1 Guava caching against Guava’s documented stale-after-write behavior by preventing racing loaders from repopulating L1 with pre-write JSON and by ensuring delete tombstones are honored even when L1 hits.
Changes:
- Add per-entity write-epoch counters and loader-side epoch checks that throw on concurrent writes to prevent stale loader results from being cached in L1.
- Add a deferred L1 “repair” eviction (scheduled re-invalidate) to close the residual loader-put-vs-invalidate race window.
- Re-check
NotFoundCacheon L1 hits forfind(UUID)/findByName(String)to deterministically surface deletes even if L1 is poisoned.
…er catches CI build caught missing imports (Cache, AtomicLong) that local incremental build missed. Also addresses bot-reviewer findings: 1. Gate the NotFoundCache check on L1 hit behind readEpochById/ByName != 0L, so uncontended reads keep the zero-Redis fast path. The marker is only consulted for keys a writer has ever touched. 2. Bump the write-epoch under both raw and quoted FQN forms — findByName quotes the input before the loader records its epoch, so writers that passed only the raw form (Glossary, Team, User, Classification etc. where quoteFqn=true) had their bumps silently ignored by the loader. 3. Narrow catch(Exception) to RejectedExecutionException / RuntimeException and add debug logs to the previously-empty ExecutionException catches. 4. FQN java.lang.Thread (org.openmetadata.schema.entity.feed.Thread collides on simple name).
…path Copilot reviewer flagged that cacheNameKey lowercases USER FQNs for L1, but the new on-L1-hit NotFoundCache check was reading with the raw fqn. A mixed-case user delete (writer markers off the canonical lowered form) would not be observable from a mixed-case read. Use cacheKey.getRight() for the marker lookups on both branches (L1 miss + L1 hit).
…etry, shorten comments Address two bot-reviewer findings: 1. scheduleL1Repair was enqueueing one task per write; under bulk write load that piles up. Coalesce on per-entity key so at most one pending repair exists per key at a time. Clear the marker on submission rejection so the next write can re-schedule. 2. NotFoundCache marker WRITES in findByName still used raw fqn after the READ side switched to canonical fqn — mixed-case USER lookups silently defeated negative-caching. Use canonical fqn consistently on both read and write paths. Slash verbose multi-paragraph code comments down to short whys. The longer rationales live in the PR description and commit messages.
Two bot findings: 1. Clearing the coalesce marker in finally (after invalidates run) blocks a writer arriving during task execution from scheduling its own backstop. Clear at task start so concurrent writes re-schedule and get the full delay-window protection. 2. Coalesce key was (entityType, id) only — a rename within the delay window was dropped because the pending task only knew the prior fqn. Include fqn in the coalesce key so renames get their own task.
|
🟡 Playwright Results — all passed (17 flaky)✅ 4267 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped
🟡 17 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Code Review ✅ Approved 6 resolved / 6 findingsImplements a multi-layered defense to resolve L1 cache stale-after-write race conditions in ✅ 6 resolved✅ Performance: NotFoundCache Redis GET added to every NON_DELETED L1 hit
✅ Quality: Error handling violates team rules: empty catch + catch(Exception)
✅ Bug: Write-epoch name key may not match loader key (quoteFqn entities)
✅ Quality: NotFoundCache marker write/read key asymmetry for USER FQNs
✅ Edge Case: L1-repair coalescing marker cleared too late, leaving a gap
...and 1 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |



Fixes #28912
Summary
Fixes the L1 cache stale-after-write race that the
postgres-elasticsearch-redisintegration test job has been catching across unrelated PRs since early June.Across 4 CI runs traced (June 9 + June 10), 6 distinct test failures all share one root cause: a Guava
LoadingCacheloader in flight when a writer'sinvalidate()runs still puts its pre-write value into L1 after the load completes — documented Guava behavior. Subsequent reads on the same pod then serve stale data until L1 TTL.Why now
The L1 race is older than #28675 — the team has known about it and partially patched it:
NotFoundCachemarker for the delete-side. The comment atEntityRepository.java:4578-4582admits the seal isn't complete: "A stale L1 entry that survives the two invalidate passes is NOT caught by this marker… the second invalidate makes that case rare in practice."What widened the window enough to tip CI 1–2 tests per run:
postgres-elasticsearch-redisCI variant that exercises more concurrent read paths (cache invalidation pubsub, async lifecycle, governance workflow consumer).OrderedLaneExecutorfor per-entity async search indexing AND movedinvalidateCachesAfterStoreout of the@Transactionboundary.What this PR changes
Three layered defenses, all in
EntityRepository.java. No public API changes, no schema changes, no new dependencies.1.
NotFoundCachecheck on L1 hitfind(UUID)andfindByName(String)now consultNotFoundCacheon L1 hit (not just L1 miss). If the entity is marked deleted but L1 still has a poisoned value from a racing loader, evict and throwEntityNotFoundException.Gated on
readEpochById/ByName != 0Lso the common, uncontended case keeps the zero-Redis L1 fast path. The marker is only consulted for keys a writer has ever touched — exactly the population where the race can occur.2. Per-entity write-epoch counter in loaders
Bounded Guava caches of
AtomicLongper(entityType, id)and(entityType, fqn). Writers (invalidateCachesAfterStore,invalidate(T),invalidateCacheForEntity) bump the epoch BEFORE invalidating L1. Loaders capture the epoch at start, re-check at end, throwLoaderRaceExceptionon mismatch — Guava skips caching on throw.find()/findByName()catch the wrapped cause and re-read via the explicit-bypass path.FQN bumps are written under both the raw and
quoteNamed forms. Repositories withquoteFqn = true(Glossary, Team, User, Classification, etc.) quote the input FQN before the loader records its epoch; bumping only the raw form would leave the loader looking at a different key. Both bumps are idempotent — the alternate form is a no-op when no loader recorded it.3. Deferred async L1 re-invalidate
Daemon-thread
ScheduledExecutorService(2 threads, namedentity-cache-l1-repair) re-invalidates the key 500ms after every writer's inline invalidate. Catches any loader whose end-of-load epoch check passed but whose Guava-internal put landed AFTER the writer's inline invalidate — the residual nanosecond window from defense 2.Failure variants covered
MlModelResourceIT.test_entityStatusGlossaryResourceIT.test_entityStatusMetadataServiceResourceIT.post_delete_entityWithOwner_200ContextFileIT.testHardDeleteFileIsAsyncDriveFileUploadIT.testUploadPdfToMinIORdfGlossaryGraphIT.scopedResponseCarriesGlossaryNameAndIdPerNodeCost
getIfPresent(sub-µs).AtomicLong.incrementAndGet, one schedule (~1 µs).NON_DELETEDread on a key a writer has ever touched: oneNotFoundCacheRedis GET. Uncontended L1 hits keep the zero-Redis fast path.Review-cycle notes
First push (
b859d412ef) failed CI build because the local incremental compile cached over two missing imports (com.google.common.cache.Cache,java.util.concurrent.atomic.AtomicLong) that a clean build catches. Second push (513b074a35) adds those plus addresses three legitimate findings from the bot reviewer:NotFoundCachecheck (Fix 1 above).catch (Exception)cleanup — narrowed toRejectedExecutionException/RuntimeExceptionand added debug logs.quoteFqnepoch-key mismatch fix (Fix 2 above).mvn clean compileruns clean locally now; spotless clean.Test plan
integration-tests-postgres-elasticsearch-redispasses 3 runs in a rowintegration-tests-mysql-elasticsearchstays green (verifies no regression on non-Redis variant)integration-tests-postgres-opensearchstays greenHonest residual risk
A loader that takes longer than 500ms to complete (e.g., a DB call near the JDBI timeout) could land its put after the deferred re-invalidate. The next read would serve stale until L1 TTL (default 30s). Probability: rare DB stalls only. If that becomes the dominant flake source post-merge, the right next step is replacing
LoadingCachewith CaffeineAsyncCache+ atomic compute — a bigger refactor that should be a separate PR.Asks
NotFoundCache-on-L1-hit check (now gated, see Fix 1) and on the epoch counter approach vs. a fullLoadingCachereplacement.postgres-elasticsearch-redisjob is a shared CI flake onmainblocking unrelated PRs.