Skip to content

fix(cache): close L1 stale-after-write race exposed by Redis CI variant#28902

Merged
yan-3005 merged 5 commits into
mainfrom
fix/l1-cache-stale-after-update
Jun 10, 2026
Merged

fix(cache): close L1 stale-after-write race exposed by Redis CI variant#28902
yan-3005 merged 5 commits into
mainfrom
fix/l1-cache-stale-after-update

Conversation

@yan-3005

@yan-3005 yan-3005 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Fixes #28912

Summary

Fixes the L1 cache stale-after-write race that the postgres-elasticsearch-redis integration test job has been catching across unrelated PRs since early June.

Across 4 CI runs traced (June 9 + June 10), 6 distinct test failures all share one root cause: a Guava LoadingCache loader in flight when a writer's invalidate() runs still puts its pre-write value into L1 after the load completes — documented Guava behavior. Subsequent reads on the same pod then serve stale data until L1 TTL.

Why now

The L1 race is older than #28675 — the team has known about it and partially patched it:

What widened the window enough to tip CI 1–2 tests per run:

What this PR changes

Three layered defenses, all in EntityRepository.java. No public API changes, no schema changes, no new dependencies.

1. NotFoundCache check on L1 hit

find(UUID) and findByName(String) now consult NotFoundCache on L1 hit (not just L1 miss). If the entity is marked deleted but L1 still has a poisoned value from a racing loader, evict and throw EntityNotFoundException.

Gated on readEpochById/ByName != 0L so the common, uncontended case keeps the zero-Redis L1 fast path. The marker is only consulted for keys a writer has ever touched — exactly the population where the race can occur.

2. Per-entity write-epoch counter in loaders

Bounded Guava caches of AtomicLong per (entityType, id) and (entityType, fqn). Writers (invalidateCachesAfterStore, invalidate(T), invalidateCacheForEntity) bump the epoch BEFORE invalidating L1. Loaders capture the epoch at start, re-check at end, throw LoaderRaceException on mismatch — Guava skips caching on throw. find() / findByName() catch the wrapped cause and re-read via the explicit-bypass path.

FQN bumps are written under both the raw and quoteNamed forms. Repositories with quoteFqn = true (Glossary, Team, User, Classification, etc.) quote the input FQN before the loader records its epoch; bumping only the raw form would leave the loader looking at a different key. Both bumps are idempotent — the alternate form is a no-op when no loader recorded it.

3. Deferred async L1 re-invalidate

Daemon-thread ScheduledExecutorService (2 threads, named entity-cache-l1-repair) re-invalidates the key 500ms after every writer's inline invalidate. Catches any loader whose end-of-load epoch check passed but whose Guava-internal put landed AFTER the writer's inline invalidate — the residual nanosecond window from defense 2.

Failure variants covered

Failure Family Caught by
MlModelResourceIT.test_entityStatus Update-side 2 + 3
GlossaryResourceIT.test_entityStatus Update-side 2 + 3 (relies on the dual quoteFqn bump)
MetadataServiceResourceIT.post_delete_entityWithOwner_200 Delete-side 1 + 3
ContextFileIT.testHardDeleteFileIsAsync Delete-side 1 + 3
DriveFileUploadIT.testUploadPdfToMinIO Update-side (async status) 2 + 3
RdfGlossaryGraphIT.scopedResponseCarriesGlossaryNameAndIdPerNode Likely same family (not traced end-to-end) 1 + 3 if delete-flavor, 2 + 3 if update-flavor

Cost

  • Memory: bounded epoch caches (~13 MB worst case at 200k entries × 2 caches × ~32 bytes each).
  • Per loader call: one extra Guava getIfPresent (sub-µs).
  • Per writer: one AtomicLong.incrementAndGet, one schedule (~1 µs).
  • Per L1-hit NON_DELETED read on a key a writer has ever touched: one NotFoundCache Redis GET. Uncontended L1 hits keep the zero-Redis fast path.
  • Two daemon threads on a tiny queue.

Review-cycle notes

First push (b859d412ef) failed CI build because the local incremental compile cached over two missing imports (com.google.common.cache.Cache, java.util.concurrent.atomic.AtomicLong) that a clean build catches. Second push (513b074a35) adds those plus addresses three legitimate findings from the bot reviewer:

  1. Performance gate on the L1-hit NotFoundCache check (Fix 1 above).
  2. Empty-catch / catch (Exception) cleanup — narrowed to RejectedExecutionException / RuntimeException and added debug logs.
  3. quoteFqn epoch-key mismatch fix (Fix 2 above).

mvn clean compile runs clean locally now; spotless clean.

Test plan

  • CI: integration-tests-postgres-elasticsearch-redis passes 3 runs in a row
  • CI: integration-tests-mysql-elasticsearch stays green (verifies no regression on non-Redis variant)
  • CI: integration-tests-postgres-opensearch stays green
  • Spotless / Java checkstyle clean
  • Build + unit tests green

Honest residual risk

A loader that takes longer than 500ms to complete (e.g., a DB call near the JDBI timeout) could land its put after the deferred re-invalidate. The next read would serve stale until L1 TTL (default 30s). Probability: rare DB stalls only. If that becomes the dominant flake source post-merge, the right next step is replacing LoadingCache with Caffeine AsyncCache + atomic compute — a bigger refactor that should be a separate PR.

Asks

The postgres-elasticsearch-redis IT job has been flaking on multiple
test families (test_entityStatus, post_delete_entityWithOwner_200,
testHardDeleteFileIsAsync, testUploadPdfToMinIO) — all the same race:
a Guava LoadingCache loader in flight when a writer's invalidate runs
can still put its pre-write value into L1 (documented Guava behavior),
so subsequent reads on the same pod serve stale data until L1 TTL.

The race has been known and partially patched before (#28197 added the
NotFoundCache marker for delete; #28100 closed a rename-cascade variant).
PR #28675's per-entity async indexing dispatcher + out-of-transaction
invalidate widened it enough to tip CI 1-2 tests per run. The L1 race
itself is not Redis-specific; the Redis IT variant just exercises more
concurrent read paths, so it surfaces what other variants don't.

Three layered defenses, all in EntityRepository:

1. NotFoundCache check on L1 hit (find / findByName). The existing
   delete-side fix only consulted the marker on L1 miss; a poisoned
   L1 entry shadowed it. Now if L1 has a value but the marker says
   deleted, we evict and throw — fully closes the delete-side race.

2. Per-entity write-epoch counter. Bounded Guava caches of AtomicLong
   per (type,id) and (type,fqn). Writers bump before invalidating L1.
   Loaders capture start epoch, re-check at end, throw a private
   LoaderRaceException on mismatch — Guava skips caching on throw.
   find / findByName catch the wrapped cause and re-read via the
   explicit-bypass path. Closes ~99% of the update-side race.

3. Deferred async L1 re-invalidate. Daemon-thread ScheduledExecutor
   re-invalidates the key 500ms after every writer's inline invalidate,
   catching any loader whose end-of-load check passed but whose
   Guava-internal put landed AFTER the inline invalidate. Covers the
   residual nanosecond window from defense 2.

Cost: bounded epoch cache (~13MB worst case), one extra Guava lookup
per loader call, one NotFoundCache Redis GET per L1-hit NON_DELETED
read (restores a previously-reverted shape), one daemon-thread schedule
per writer. No API or schema changes.
Copilot AI review requested due to automatic review settings June 10, 2026 06:51
@yan-3005 yan-3005 added the safe to test Add this label to run secure Github workflows on PRs label Jun 10, 2026
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens EntityRepository’s L1 Guava caching against Guava’s documented stale-after-write behavior by preventing racing loaders from repopulating L1 with pre-write JSON and by ensuring delete tombstones are honored even when L1 hits.

Changes:

  • Add per-entity write-epoch counters and loader-side epoch checks that throw on concurrent writes to prevent stale loader results from being cached in L1.
  • Add a deferred L1 “repair” eviction (scheduled re-invalidate) to close the residual loader-put-vs-invalidate race window.
  • Re-check NotFoundCache on L1 hits for find(UUID) / findByName(String) to deterministically surface deletes even if L1 is poisoned.

yan-3005 added 2 commits June 10, 2026 12:39
…er catches

CI build caught missing imports (Cache, AtomicLong) that local incremental
build missed. Also addresses bot-reviewer findings:

1. Gate the NotFoundCache check on L1 hit behind readEpochById/ByName != 0L,
   so uncontended reads keep the zero-Redis fast path. The marker is only
   consulted for keys a writer has ever touched.
2. Bump the write-epoch under both raw and quoted FQN forms — findByName
   quotes the input before the loader records its epoch, so writers that
   passed only the raw form (Glossary, Team, User, Classification etc.
   where quoteFqn=true) had their bumps silently ignored by the loader.
3. Narrow catch(Exception) to RejectedExecutionException / RuntimeException
   and add debug logs to the previously-empty ExecutionException catches.
4. FQN java.lang.Thread (org.openmetadata.schema.entity.feed.Thread
   collides on simple name).
…path

Copilot reviewer flagged that cacheNameKey lowercases USER FQNs for L1,
but the new on-L1-hit NotFoundCache check was reading with the raw fqn.
A mixed-case user delete (writer markers off the canonical lowered form)
would not be observable from a mixed-case read. Use cacheKey.getRight()
for the marker lookups on both branches (L1 miss + L1 hit).
Copilot AI review requested due to automatic review settings June 10, 2026 07:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

…etry, shorten comments

Address two bot-reviewer findings:

1. scheduleL1Repair was enqueueing one task per write; under bulk write load that
   piles up. Coalesce on per-entity key so at most one pending repair exists per
   key at a time. Clear the marker on submission rejection so the next write can
   re-schedule.
2. NotFoundCache marker WRITES in findByName still used raw fqn after the READ
   side switched to canonical fqn — mixed-case USER lookups silently defeated
   negative-caching. Use canonical fqn consistently on both read and write paths.

Slash verbose multi-paragraph code comments down to short whys. The longer
rationales live in the PR description and commit messages.
Two bot findings:
1. Clearing the coalesce marker in finally (after invalidates run) blocks a
   writer arriving during task execution from scheduling its own backstop.
   Clear at task start so concurrent writes re-schedule and get the full
   delay-window protection.
2. Coalesce key was (entityType, id) only — a rename within the delay window
   was dropped because the pending task only knew the prior fqn. Include
   fqn in the coalesce key so renames get their own task.
Copilot AI review requested due to automatic review settings June 10, 2026 07:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

@sonarqubecloud

Copy link
Copy Markdown

@yan-3005 yan-3005 enabled auto-merge (squash) June 10, 2026 09:16
@yan-3005 yan-3005 merged commit 315f015 into main Jun 10, 2026
67 of 70 checks passed
@yan-3005 yan-3005 deleted the fix/l1-cache-stale-after-update branch June 10, 2026 10:20
@github-actions

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (17 flaky)

✅ 4267 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 299 0 2 4
🟡 Shard 2 804 0 2 9
🟡 Shard 3 805 0 3 8
🟡 Shard 4 839 0 4 12
🟡 Shard 5 719 0 2 47
🟡 Shard 6 801 0 4 8
🟡 17 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Container - customization should work (shard 1, 1 retry)
  • Flow/Metric.spec.ts › Metric creation flow should work (shard 1, 1 retry)
  • Features/Glossary/GlossaryP3Tests.spec.ts › should handle multiple rapid API calls (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Resolving incident & re-run pipeline (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Flow/ObservabilityAlerts.spec.ts › Alert operations for a user with and without permissions (shard 4, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner Add, Update and Remove (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Team as Owner Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/ExploreTree.spec.ts › Explore Tree (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for mlModel -> dashboard (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Users.spec.ts › Reset Password for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@gitar-bot

gitar-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Code Review ✅ Approved 6 resolved / 6 findings

Implements a multi-layered defense to resolve L1 cache stale-after-write race conditions in EntityRepository. Addresses the cache key asymmetry for quoted FQNs, replaces generic error handling with specific exceptions, and introduces an epoch-gated NotFoundCache check on L1 hits.

✅ 6 resolved
Performance: NotFoundCache Redis GET added to every NON_DELETED L1 hit

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:1659-1673 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2299-2306
Defense #1 adds notFoundCache.isMarkedNotFoundById/ByName(...) to the L1-hit branch of both find(UUID) and findByName(String). As verified, those NotFoundCache methods delegate to the cache provider and perform a Redis network round trip per call. The hot-path comment immediately above (lines 1641-1644) states that an L1 hit serves the entity with zero Redis traffic; this change negates that for every NON_DELETED read, turning the most frequent read path into one Redis GET per request. On read-heavy workloads this is a meaningful latency/throughput regression on a path that was specifically optimized (the PR notes the unconditional shape was previously reverted for exactly this cost). The correctness benefit only applies to the rare delete-race poison case. Consider gating this extra GET behind a cheaper signal (e.g., only when a recent epoch bump for the key is observed, or sampling/short local TTL on the negative marker) so the common case keeps its zero-Redis fast path while still closing the race.

Quality: Error handling violates team rules: empty catch + catch(Exception)

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:383-385 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:390-392 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:459-462 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:466-470
The team's error-handling conventions forbid empty catch blocks and catch (Exception e). This diff introduces both: bumpWriteEpoch has two effectively-empty catch (ExecutionException e) blocks (only a comment, no statement or rethrow), and scheduleL1Repair uses two broad catch (Exception e) blocks (one inside the scheduled task, one around schedule(...)). While the intent (AtomicLong::new cannot throw; pool rejection is non-fatal) is reasonable, please conform to the conventions: narrow the caught type and/or add at least a debug log so the empty blocks are not silent. The broad catch (Exception e) around schedule should be narrowed to RejectedExecutionException.

Bug: Write-epoch name key may not match loader key (quoteFqn entities)

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:377-391 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:4794-4799 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:10438-10443 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2257-2258
The name-side epoch guard only works if writers and the loader use an identical cacheNameKey value. Writers call bumpWriteEpoch(entityType, id, entity.getFullyQualifiedName()) with the entity's raw stored FQN (e.g. invalidate(T) at 4797), whereas findByName quotes the incoming fqn (fqn = quoteFqn ? quoteName(fqn) : fqn;) before the loader records its epoch via readEpochByName(fqnPair). For the ~10 repositories with quoteFqn = true (Team, User, Glossary, Classification, etc.) the loader's epoch key is built from the quoted form while the writer bumps the unquoted form. If those two strings differ for a given FQN, defense #2 silently no-ops on the name path (the loader's start/end epoch reads both return 0 and never observe the writer's bump), leaving only the deferred repair + TTL. The ID-based epoch (UUID key) is unaffected, so impact is limited, but the name-path guard should key off the same normalized/quoted value the loader uses to be effective.

Quality: NotFoundCache marker write/read key asymmetry for USER FQNs

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2293 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2303 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2325 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2334 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:2362
This commit changes only the two L1 read-path checks (lines 2325 and 2334) to look up the NotFoundCache marker under canonicalFqn (the lowercased form cacheNameKey produces for Entity.USER). However, the marker writes this same method performs — markNotFoundByName(entityType, fqn) at line 2362 (ExecutionException/EntityNotFound handler) and line 2303 (bypass path), as well as the bypass-path check at line 2293 — still use the raw fqn.

For a USER entity looked up with a mixed-case FQN (e.g. John.Doe): the loader throws EntityNotFound, line 2362 writes the marker under raw John.Doe, but the next hot-path read at line 2325 consults the marker under john.doe (canonical) and misses, so the loader runs and hits the DB again. The negative cache is silently defeated for exactly the mixed-case USER case the diff comment says it is fixing. This is an efficiency regression (extra DB SELECTs), not a correctness bug, since a marker miss just falls through to a safe DB lookup.

Suggested fix: canonicalize consistently on both sides. Either pass canonicalFqn (i.e. cacheKey.getRight()) to the markNotFoundByName calls and the bypass-path check, or canonicalize once inside NotFoundCache/CacheKeys.notFoundByName. Note the delete-side writer (markEntityNotFound) keys off entity.getFullyQualifiedName(), which only matches canonicalFqn if USER FQNs are always stored lowercase — worth confirming so the three defenses agree on one canonical key form.

Edge Case: L1-repair coalescing marker cleared too late, leaving a gap

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:431-445
In the new coalescing guard, the marker is added before scheduling and removed in the task's finally block (line 452), i.e. only after the deferred invalidate has already executed ~500ms later. This creates two gaps in defense 3:

  1. Writer arriving during task execution is dropped. When the repair task fires at t≈500ms it performs the L1 invalidations and then removes the coalesce key. A writer that bumps the epoch and runs its inline invalidate during the short window after the task's invalidate has run but before the finally removes the marker will get PENDING_L1_REPAIRS.add(...) == false, so no new repair task is scheduled — yet the just-completed repair ran before that writer's invalidate. That writer's racing loader can land a stale put with no deferred backstop, which is exactly the nanosecond window this mechanism exists to close.

  2. Last write of a burst under-protected. Any write coalesced at t=499ms is covered only until the single task fires at t=500ms, well short of the intended full L1_REPAIR_DELAY_MS after that write.

The standard fix is to remove the marker at the start of task execution (before invalidating), so any write arriving during/after the repair re-schedules its own task. The epoch check (defense 2) and L1 TTL remain as backstops, so this is not catastrophic, but it narrows the very guarantee the coalescing change is meant to preserve.

...and 1 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redis IT job flakes on L1 cache stale-after-write race

3 participants