fix(cache): close L1 stale-after-write race exposed by Redis CI variant by yan-3005 · Pull Request #28902 · open-metadata/OpenMetadata

yan-3005 · 2026-06-10T06:51:32Z

Summary

Fixes the L1 cache stale-after-write race that the postgres-elasticsearch-redis integration test job has been catching across unrelated PRs since early June.

Across 4 CI runs traced (June 9 + June 10), 6 distinct test failures all share one root cause: a Guava LoadingCache loader in flight when a writer's invalidate() runs still puts its pre-write value into L1 after the load completes — documented Guava behavior. Subsequent reads on the same pod then serve stale data until L1 TTL.

Why now

The L1 race is older than #28675 — the team has known about it and partially patched it:

fix(cache): seal hard-delete stale-read window via NotFoundCache marker #28197 (May 16) added NotFoundCache marker for the delete-side. The comment at EntityRepository.java:4578-4582 admits the seal isn't complete: "A stale L1 entry that survives the two invalidate passes is NOT caught by this marker… the second invalidate makes that case rare in practice."
fix(cache): close rename-cascade race and bot-task delete eviction gap #28100 (May 21) closed a rename-cascade variant of the same family.

What widened the window enough to tip CI 1–2 tests per run:

Cache improvements: lineage + search layers, observability, CI gate #28012 (May 13) added the postgres-elasticsearch-redis CI variant that exercises more concurrent read paths (cache invalidation pubsub, async lifecycle, governance workflow consumer).
perf: one-transaction flush + async indexing write path, read-path fixes #28675 (June 5) introduced OrderedLaneExecutor for per-entity async search indexing AND moved invalidateCachesAfterStore out of the @Transaction boundary.

What this PR changes

Three layered defenses, all in EntityRepository.java. No public API changes, no schema changes, no new dependencies.

1. `NotFoundCache` check on L1 hit

find(UUID) and findByName(String) now consult NotFoundCache on L1 hit (not just L1 miss). If the entity is marked deleted but L1 still has a poisoned value from a racing loader, evict and throw EntityNotFoundException.

Gated on readEpochById/ByName != 0L so the common, uncontended case keeps the zero-Redis L1 fast path. The marker is only consulted for keys a writer has ever touched — exactly the population where the race can occur.

2. Per-entity write-epoch counter in loaders

Bounded Guava caches of AtomicLong per (entityType, id) and (entityType, fqn). Writers (invalidateCachesAfterStore, invalidate(T), invalidateCacheForEntity) bump the epoch BEFORE invalidating L1. Loaders capture the epoch at start, re-check at end, throw LoaderRaceException on mismatch — Guava skips caching on throw. find() / findByName() catch the wrapped cause and re-read via the explicit-bypass path.

FQN bumps are written under both the raw and quoteNamed forms. Repositories with quoteFqn = true (Glossary, Team, User, Classification, etc.) quote the input FQN before the loader records its epoch; bumping only the raw form would leave the loader looking at a different key. Both bumps are idempotent — the alternate form is a no-op when no loader recorded it.

3. Deferred async L1 re-invalidate

Daemon-thread ScheduledExecutorService (2 threads, named entity-cache-l1-repair) re-invalidates the key 500ms after every writer's inline invalidate. Catches any loader whose end-of-load epoch check passed but whose Guava-internal put landed AFTER the writer's inline invalidate — the residual nanosecond window from defense 2.

Failure variants covered

Failure	Family	Caught by
`MlModelResourceIT.test_entityStatus`	Update-side	2 + 3
`GlossaryResourceIT.test_entityStatus`	Update-side	2 + 3 (relies on the dual quoteFqn bump)
`MetadataServiceResourceIT.post_delete_entityWithOwner_200`	Delete-side	1 + 3
`ContextFileIT.testHardDeleteFileIsAsync`	Delete-side	1 + 3
`DriveFileUploadIT.testUploadPdfToMinIO`	Update-side (async status)	2 + 3
`RdfGlossaryGraphIT.scopedResponseCarriesGlossaryNameAndIdPerNode`	Likely same family (not traced end-to-end)	1 + 3 if delete-flavor, 2 + 3 if update-flavor

Cost

Memory: bounded epoch caches (~13 MB worst case at 200k entries × 2 caches × ~32 bytes each).
Per loader call: one extra Guava getIfPresent (sub-µs).
Per writer: one AtomicLong.incrementAndGet, one schedule (~1 µs).
Per L1-hit NON_DELETED read on a key a writer has ever touched: one NotFoundCache Redis GET. Uncontended L1 hits keep the zero-Redis fast path.
Two daemon threads on a tiny queue.

Review-cycle notes

First push (b859d412ef) failed CI build because the local incremental compile cached over two missing imports (com.google.common.cache.Cache, java.util.concurrent.atomic.AtomicLong) that a clean build catches. Second push (513b074a35) adds those plus addresses three legitimate findings from the bot reviewer:

Performance gate on the L1-hit NotFoundCache check (Fix 1 above).
Empty-catch / catch (Exception) cleanup — narrowed to RejectedExecutionException / RuntimeException and added debug logs.
quoteFqn epoch-key mismatch fix (Fix 2 above).

mvn clean compile runs clean locally now; spotless clean.

Test plan

CI: integration-tests-postgres-elasticsearch-redis passes 3 runs in a row
CI: integration-tests-mysql-elasticsearch stays green (verifies no regression on non-Redis variant)
CI: integration-tests-postgres-opensearch stays green
Spotless / Java checkstyle clean
Build + unit tests green

Honest residual risk

A loader that takes longer than 500ms to complete (e.g., a DB call near the JDBI timeout) could land its put after the deferred re-invalidate. The next read would serve stale until L1 TTL (default 30s). Probability: rare DB stalls only. If that becomes the dominant flake source post-merge, the right next step is replacing LoadingCache with Caffeine AsyncCache + atomic compute — a bigger refactor that should be a separate PR.

Asks

Cache-write-path owners (perf: one-transaction flush + async indexing write path, read-path fixes #28675 / fix(cache): seal hard-delete stale-read window via NotFoundCache marker #28197 authors): please weigh in on the trade-off of restoring the NotFoundCache-on-L1-hit check (now gated, see Fix 1) and on the epoch counter approach vs. a full LoadingCache replacement.
Until this lands, the postgres-elasticsearch-redis job is a shared CI flake on main blocking unrelated PRs.

The postgres-elasticsearch-redis IT job has been flaking on multiple test families (test_entityStatus, post_delete_entityWithOwner_200, testHardDeleteFileIsAsync, testUploadPdfToMinIO) — all the same race: a Guava LoadingCache loader in flight when a writer's invalidate runs can still put its pre-write value into L1 (documented Guava behavior), so subsequent reads on the same pod serve stale data until L1 TTL. The race has been known and partially patched before (#28197 added the NotFoundCache marker for delete; #28100 closed a rename-cascade variant). PR #28675's per-entity async indexing dispatcher + out-of-transaction invalidate widened it enough to tip CI 1-2 tests per run. The L1 race itself is not Redis-specific; the Redis IT variant just exercises more concurrent read paths, so it surfaces what other variants don't. Three layered defenses, all in EntityRepository: 1. NotFoundCache check on L1 hit (find / findByName). The existing delete-side fix only consulted the marker on L1 miss; a poisoned L1 entry shadowed it. Now if L1 has a value but the marker says deleted, we evict and throw — fully closes the delete-side race. 2. Per-entity write-epoch counter. Bounded Guava caches of AtomicLong per (type,id) and (type,fqn). Writers bump before invalidating L1. Loaders capture start epoch, re-check at end, throw a private LoaderRaceException on mismatch — Guava skips caching on throw. find / findByName catch the wrapped cause and re-read via the explicit-bypass path. Closes ~99% of the update-side race. 3. Deferred async L1 re-invalidate. Daemon-thread ScheduledExecutor re-invalidates the key 500ms after every writer's inline invalidate, catching any loader whose end-of-load check passed but whose Guava-internal put landed AFTER the inline invalidate. Covers the residual nanosecond window from defense 2. Cost: bounded epoch cache (~13MB worst case), one extra Guava lookup per loader call, one NotFoundCache Redis GET per L1-hit NON_DELETED read (restores a previously-reverted shape), one daemon-thread schedule per writer. No API or schema changes.

github-actions · 2026-06-10T06:51:42Z

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

Copilot

Pull request overview

This PR hardens EntityRepository’s L1 Guava caching against Guava’s documented stale-after-write behavior by preventing racing loaders from repopulating L1 with pre-write JSON and by ensuring delete tombstones are honored even when L1 hits.

Changes:

Add per-entity write-epoch counters and loader-side epoch checks that throw on concurrent writes to prevent stale loader results from being cached in L1.
Add a deferred L1 “repair” eviction (scheduled re-invalidate) to close the residual loader-put-vs-invalidate race window.
Re-check NotFoundCache on L1 hits for find(UUID) / findByName(String) to deterministically surface deletes even if L1 is poisoned.

…er catches CI build caught missing imports (Cache, AtomicLong) that local incremental build missed. Also addresses bot-reviewer findings: 1. Gate the NotFoundCache check on L1 hit behind readEpochById/ByName != 0L, so uncontended reads keep the zero-Redis fast path. The marker is only consulted for keys a writer has ever touched. 2. Bump the write-epoch under both raw and quoted FQN forms — findByName quotes the input before the loader records its epoch, so writers that passed only the raw form (Glossary, Team, User, Classification etc. where quoteFqn=true) had their bumps silently ignored by the loader. 3. Narrow catch(Exception) to RejectedExecutionException / RuntimeException and add debug logs to the previously-empty ExecutionException catches. 4. FQN java.lang.Thread (org.openmetadata.schema.entity.feed.Thread collides on simple name).

…path Copilot reviewer flagged that cacheNameKey lowercases USER FQNs for L1, but the new on-L1-hit NotFoundCache check was reading with the raw fqn. A mixed-case user delete (writer markers off the canonical lowered form) would not be observable from a mixed-case read. Use cacheKey.getRight() for the marker lookups on both branches (L1 miss + L1 hit).

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

…etry, shorten comments Address two bot-reviewer findings: 1. scheduleL1Repair was enqueueing one task per write; under bulk write load that piles up. Coalesce on per-entity key so at most one pending repair exists per key at a time. Clear the marker on submission rejection so the next write can re-schedule. 2. NotFoundCache marker WRITES in findByName still used raw fqn after the READ side switched to canonical fqn — mixed-case USER lookups silently defeated negative-caching. Use canonical fqn consistently on both read and write paths. Slash verbose multi-paragraph code comments down to short whys. The longer rationales live in the PR description and commit messages.

Two bot findings: 1. Clearing the coalesce marker in finally (after invalidates run) blocks a writer arriving during task execution from scheduling its own backstop. Clear at task start so concurrent writes re-schedule and get the full delay-window protection. 2. Coalesce key was (entityType, id) only — a rename within the delay window was dropped because the pending task only knew the prior fqn. Include fqn in the coalesce key so renames get their own task.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

sonarqubecloud · 2026-06-10T08:47:35Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-06-10T10:20:26Z

🟡 Playwright Results — all passed (17 flaky)

✅ 4267 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	299	2	4
🟡 Shard 2	804	2	9
🟡 Shard 3	805	3	8
🟡 Shard 4	839	4	12
🟡 Shard 5	719	2	47
🟡 Shard 6	801	4	8

🟡 17 flaky test(s) (passed on retry)

Features/CustomizeDetailPage.spec.ts › Container - customization should work (shard 1, 1 retry)
Flow/Metric.spec.ts › Metric creation flow should work (shard 1, 1 retry)
Features/Glossary/GlossaryP3Tests.spec.ts › should handle multiple rapid API calls (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
Features/IncidentManager.spec.ts › Resolving incident & re-run pipeline (shard 3, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
Flow/ObservabilityAlerts.spec.ts › Alert operations for a user with and without permissions (shard 4, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 4, 1 retry)
Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
Pages/Entity.spec.ts › User as Owner Add, Update and Remove (shard 4, 1 retry)
Pages/EntityDataSteward.spec.ts › Team as Owner Add, Update and Remove (shard 5, 1 retry)
Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
Pages/ExploreTree.spec.ts › Explore Tree (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for mlModel -> dashboard (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Users.spec.ts › Reset Password for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

gitar-bot · 2026-06-10T10:21:50Z

Code Review ✅ Approved 6 resolved / 6 findings

Implements a multi-layered defense to resolve L1 cache stale-after-write race conditions in EntityRepository. Addresses the cache key asymmetry for quoted FQNs, replaces generic error handling with specific exceptions, and introduces an epoch-gated NotFoundCache check on L1 hits.

✅ 6 resolved

✅ Performance: NotFoundCache Redis GET added to every NON_DELETED L1 hit

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:1659-1673 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2299-2306
Defense #1 adds notFoundCache.isMarkedNotFoundById/ByName(...) to the L1-hit branch of both find(UUID) and findByName(String). As verified, those NotFoundCache methods delegate to the cache provider and perform a Redis network round trip per call. The hot-path comment immediately above (lines 1641-1644) states that an L1 hit serves the entity with zero Redis traffic; this change negates that for every NON_DELETED read, turning the most frequent read path into one Redis GET per request. On read-heavy workloads this is a meaningful latency/throughput regression on a path that was specifically optimized (the PR notes the unconditional shape was previously reverted for exactly this cost). The correctness benefit only applies to the rare delete-race poison case. Consider gating this extra GET behind a cheaper signal (e.g., only when a recent epoch bump for the key is observed, or sampling/short local TTL on the negative marker) so the common case keeps its zero-Redis fast path while still closing the race.

✅ Quality: Error handling violates team rules: empty catch + catch(Exception)

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:383-385 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:390-392 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:459-462 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:466-470
The team's error-handling conventions forbid empty catch blocks and catch (Exception e). This diff introduces both: bumpWriteEpoch has two effectively-empty catch (ExecutionException e) blocks (only a comment, no statement or rethrow), and scheduleL1Repair uses two broad catch (Exception e) blocks (one inside the scheduled task, one around schedule(...)). While the intent (AtomicLong::new cannot throw; pool rejection is non-fatal) is reasonable, please conform to the conventions: narrow the caught type and/or add at least a debug log so the empty blocks are not silent. The broad catch (Exception e) around schedule should be narrowed to RejectedExecutionException.

✅ Bug: Write-epoch name key may not match loader key (quoteFqn entities)

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:377-391 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:4794-4799 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:10438-10443 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2257-2258
The name-side epoch guard only works if writers and the loader use an identical cacheNameKey value. Writers call bumpWriteEpoch(entityType, id, entity.getFullyQualifiedName()) with the entity's raw stored FQN (e.g. invalidate(T) at 4797), whereas findByName quotes the incoming fqn (fqn = quoteFqn ? quoteName(fqn) : fqn;) before the loader records its epoch via readEpochByName(fqnPair). For the ~10 repositories with quoteFqn = true (Team, User, Glossary, Classification, etc.) the loader's epoch key is built from the quoted form while the writer bumps the unquoted form. If those two strings differ for a given FQN, defense #2 silently no-ops on the name path (the loader's start/end epoch reads both return 0 and never observe the writer's bump), leaving only the deferred repair + TTL. The ID-based epoch (UUID key) is unaffected, so impact is limited, but the name-path guard should key off the same normalized/quoted value the loader uses to be effective.

✅ Quality: NotFoundCache marker write/read key asymmetry for USER FQNs

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2293 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2303 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2325 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2334 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2362
This commit changes only the two L1 read-path checks (lines 2325 and 2334) to look up the NotFoundCache marker under canonicalFqn (the lowercased form cacheNameKey produces for Entity.USER). However, the marker writes this same method performs — markNotFoundByName(entityType, fqn) at line 2362 (ExecutionException/EntityNotFound handler) and line 2303 (bypass path), as well as the bypass-path check at line 2293 — still use the raw fqn.

For a USER entity looked up with a mixed-case FQN (e.g. John.Doe): the loader throws EntityNotFound, line 2362 writes the marker under raw John.Doe, but the next hot-path read at line 2325 consults the marker under john.doe (canonical) and misses, so the loader runs and hits the DB again. The negative cache is silently defeated for exactly the mixed-case USER case the diff comment says it is fixing. This is an efficiency regression (extra DB SELECTs), not a correctness bug, since a marker miss just falls through to a safe DB lookup.

Suggested fix: canonicalize consistently on both sides. Either pass canonicalFqn (i.e. cacheKey.getRight()) to the markNotFoundByName calls and the bypass-path check, or canonicalize once inside NotFoundCache/CacheKeys.notFoundByName. Note the delete-side writer (markEntityNotFound) keys off entity.getFullyQualifiedName(), which only matches canonicalFqn if USER FQNs are always stored lowercase — worth confirming so the three defenses agree on one canonical key form.

✅ Edge Case: L1-repair coalescing marker cleared too late, leaving a gap

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:431-445
In the new coalescing guard, the marker is added before scheduling and removed in the task's finally block (line 452), i.e. only after the deferred invalidate has already executed ~500ms later. This creates two gaps in defense 3:

Writer arriving during task execution is dropped. When the repair task fires at t≈500ms it performs the L1 invalidations and then removes the coalesce key. A writer that bumps the epoch and runs its inline invalidate during the short window after the task's invalidate has run but before the finally removes the marker will get PENDING_L1_REPAIRS.add(...) == false, so no new repair task is scheduled — yet the just-completed repair ran before that writer's invalidate. That writer's racing loader can land a stale put with no deferred backstop, which is exactly the nanosecond window this mechanism exists to close.

Last write of a burst under-protected. Any write coalesced at t=499ms is covered only until the single task fires at t=500ms, well short of the intended full L1_REPAIR_DELAY_MS after that write.

The standard fix is to remove the marker at the start of task execution (before invalidating), so any write arriving during/after the repair re-schedules its own task. The epoch check (defense 2) and L1 TTL remain as backstops, so this is not catastrophic, but it narrows the very guarantee the coalescing change is meant to preserve.

...and 1 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot AI review requested due to automatic review settings June 10, 2026 06:51

yan-3005 added the safe to test Add this label to run secure Github workflows on PRs label Jun 10, 2026

Copilot started reviewing on behalf of yan-3005 June 10, 2026 06:51 View session