Skip to content

feat(search): consumer-field contract test + index mapping health check#28751

Closed
pmbrull wants to merge 21 commits into
mainfrom
pmbrull/search-indexing-audit
Closed

feat(search): consumer-field contract test + index mapping health check#28751
pmbrull wants to merge 21 commits into
mainfrom
pmbrull/search-indexing-audit

Conversation

@pmbrull

@pmbrull pmbrull commented Jun 5, 2026

Copy link
Copy Markdown
Member

Describe your changes:

fix #28849

Search denormalizes a set of fields (tags.tagFQN, tier, certification, owners, domains, fqnParts, testCaseResult.testCaseStatus, …) into every entity document, and RBAC, Data Quality, Incident Manager, Lineage, and Data Insights query them directly — so a mapping change that renames, retypes, or drops one breaks those features silently, with no compile- or boot-time failure. This PR hardens that area with a CI contract guard, a runtime "did we forget to reindex?" health check, and the mapping bug the guards surfaced.

Type of change:

  • Bug fix
  • Improvement

High-level design:

  • SearchConsumerFields — one source of truth for the consumer contract (denormalized leaf fields + required keyword type, and the core data-asset canary set).
  • SearchConsumerFieldContractTest (CI) — over every entity mapping × 4 languages: (1) the denormalized leaf fields keep their keyword type wherever present, and (2) core data-asset indexes expose the full denormalized set. Failures name the affected consumer. This guards the shipped mapping files at build time.
  • "Search Reindex Status" health check — a non-blocking StepValidation in SystemRepository.validateSystem() (the /system/validate Search section). It answers "is a reindex pending?" by comparing each entity's current code-mapping hash against the hash the deployed index was last built from (IndexMappingVersionTracker), and warns "reindex required" for any entity whose deployed index is stale. This replaces the earlier curated-subset live-field check, which only inspected 8 hard-coded fields on 10 hard-coded entities.
    • The hash signal is only trustworthy if the stored hash is written whenever an index is genuinely (re)built. Previously only the CLI reindex stamped it, so a UI or scheduled reindex left the table stale and the signal would false-positive. This PR fixes that root cause by stamping at every successful seam:
      • SearchIndexApp.execute() — stamps the reindexed entities when the run finishes COMPLETED (covers UI, scheduled, and CLI-triggered reindex); never stamps on ACTIVE_ERROR/FAILED/STOPPED.
      • SearchRepository.createIndexes() — stamps each entity actually recreated (fresh install / drop-create), never one whose recreate threw.
      • The now-redundant CLI-only stamp in OpenMetadataOperations is removed (the same SearchIndexApp it triggers now owns stamping).
    • Untracked entities (e.g. an existing deploy that upgrades in before its next reindex) are reported as a soft, non-failing note, not a failure. Missing indexes stay the responsibility of the existing Search-instance check.
    • A pure-function live field-by-field comparison was considered and deliberately dropped: with reliable stamping, a matching hash already implies the deployed fields are correct, and a live diff only catches a "lying stamp" (alias-swap, partial build, manual drift) at meaningful cost and false-positive surface. It can be added later as defense-in-depth.
  • Cleanup: removed the orphaned getIndexFieldNames plumbing (added for the now-replaced live-field check) from SearchRepository, the SearchClient interface, and the Elasticsearch/OpenSearch managers.
  • Fix: the jp/topic index mapping was missing top-level domains (the contract test caught it), so JP-locale topic search had lost domain-based RBAC and DQ/Incident filtering. Added to match en/ru/zh.

Tests:

Use cases covered

  • A mapping change that retypes/renames/drops a shared consumer field fails CI, naming the affected consumer.
  • /system/validate reports a non-blocking warning when a deployed index was built from an older code mapping (a reindex is pending), and reports clean once every index has been (re)built from the current mappings.
  • A failed/partial reindex never stamps, so the check cannot report "clean" on a stale index.

Unit tests

  • Added/updated:
    • IndexMappingVersionTrackerTestcomputeDrift classifies CURRENT/STALE/UNTRACKED; updateMappingVersions(Collection) stamps only the given subset (plus the existing change-detection coverage).
    • SystemRepositoryReindexStatusTest — pure classification (stale→pending, untracked→note, missing→skip, current→none, sorted output) and the operator-facing message.
    • SearchRepositoryBehaviorTest#createIndexesStampsOnlySucceededEntities — a partial recreate (one entity throws) stamps only the succeeded entity.
    • SearchConsumerFieldContractTest — unchanged, still green.
  • The old mock-heavy SystemRepositoryMappingConsistencyTest (asserted the removed canary logic) was deleted.

Backend integration tests

  • Not applicable — no new API endpoint; the validateSystem change is unit-covered. Recommended manual check on a live stack: run a reindex, confirm /system/status "Search Reindex Status" passes and index_mapping_versions is populated; tamper one stored hash and re-check to see that entity reported as reindex-pending.

Ingestion integration tests

  • Not applicable (no ingestion changes).

Playwright (UI) tests

  • Not applicable (no UI changes).

Manual testing performed

  1. mvn -pl openmetadata-service spotless:check → clean.
  2. mvn -pl openmetadata-spec install -DskipTests then mvn -pl openmetadata-service test -Dtest='IndexMappingVersionTrackerTest,SystemRepositoryReindexStatusTest,SearchRepositoryBehaviorTest,SearchConsumerFieldContractTest' → all green (13 + 7 + 110 + 2). (Build openmetadata-spec first so the contract test reads the current mappings, not a stale local jar.)

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: not needed — the only mapping change is an additive Elasticsearch index-mapping field; existing indexes pick it up on the next reindex, which the new health check now surfaces.
  • For UI changes: not applicable.
  • I have added tests (unit) and listed them above.

Guard the denormalized search-index fields that RBAC, Data Quality, Incident
Manager, Lineage, and Data Insights query (tags.tagFQN, tier, certification,
owners, domains, fqnParts, testCaseStatus, ...). Renaming, retyping, or dropping
one silently breaks those consumers with no compile- or boot-time failure.

- SearchConsumerFields: single source of truth for the consumer contract.
- SearchConsumerFieldContractTest: fails CI on a type change or a dropped field
  in any mapping (across all 4 languages), naming the affected consumers.
- "Index Mapping Consistency" StepValidation on /system/validate: non-blocking
  health check that reads the live deployed mapping and warns "reindex required"
  when a core data-asset index is missing these fields (e.g. stale after upgrade).
  Adds a read-only getIndexFieldNames to the SearchClient (Elasticsearch + OpenSearch).
- Fix jp/topic mapping missing top-level `domains` (found by the new test) —
  JP-locale topic search had lost domain-based RBAC/DQ filtering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Jun 5, 2026
…ract

Index a document into the real en/jp/ru/zh index mapping for topic and
test_case, then run the actual search/filter/aggregation each feature uses
(tag/tier/certification/domain term filters, owners nested RBAC query,
testCaseResult.testCaseStatus aggregation, fqnParts term) and assert it returns
the document. A failure reads as a broken feature in a specific language, not
"index is missing a mapping" — and would have caught the jp/topic domains drop.

Runs against a real OpenSearch testcontainer. Text analyzers are stripped before
index creation (the consumer fields are keyword/nested and analyzer-independent)
so each language index is creatable without analysis plugins while the real
field structure is still validated per language.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
30 of 55 jp index mappings referenced an analyzer the file does not define: an
incomplete om_analyzer <-> om_analyzer_jp split left field-level analyzer /
search_analyzer references dangling, so those jp indexes fail to create
("analyzer [...] has not been configured") whenever searchIndexMappingLanguage is
set to jp. It is invisible on en-default deployments, which never load these
files — the same reason the jp/topic domains drop went unnoticed.

Add the missing analyzer definition to each file, preserving every field's
existing reference: 29 files gain the generic om_analyzer (letter + lowercase +
om_stemmer, matching the other jp mappings) used by identifier fields, and
api_collection gains the kuromoji om_analyzer_jp used by its text fields. No field
reference is changed, so prose stays kuromoji-analyzed while identifier fields
(tagFQN, column names, custom-property text) stay on the generic analyzer,
consistent with en and the already-correct jp mappings. jp still requires the
analysis-kuromoji plugin. Surfaced by SearchConsumerFieldBehaviorIT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pmbrull pmbrull force-pushed the pmbrull/search-indexing-audit branch from 6016bad to 73282d0 Compare June 8, 2026 06:56
Build the integration-test OpenSearch container from an image with the language
analysis plugins installed (analysis-kuromoji for Japanese) via a shared
SearchTestImages helper, wired into both TestSuiteBootstrap and ContainerizedServer.
The plugins are installed unconditionally — an English-only run never references
them — so the search IT suite can be run for non-English mapping languages. That is
what would have caught the jp analyzer drift earlier: CI only ever ran en on a
vanilla image, where the jp mappings (kuromoji) could never even be created.

Add OpenSearchLanguageAnalyzerIT: boots the plugin image and creates the real
en/ru/jp topic index mappings with their own analyzers (no analysis stripping),
asserting analysis-kuromoji is installed and that the jp mappings — which reference
kuromoji and previously had dangling analyzer references — now create cleanly. zh
(third-party analysis-ik) is intentionally out of scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pmbrull and others added 15 commits June 8, 2026 13:09
…h IT

Run the per-language consumer-contract checks directly in the search IT, so the
whole integration suite does not need to run once per language. The container now
uses the analysis-plugin image (SearchTestImages), so en/ru/jp index with their
real analyzers (jp = kuromoji) — which also exercises jp analyzer resolution end to
end — and only zh (third-party analysis-ik, not installed) is validated
structurally with its analysis stripped. Fold the plugin-present and real-jp-index
creation checks into this IT and drop the standalone OpenSearchLanguageAnalyzerIT.

Aggregation assertions now compare bucket keys case-insensitively: real keyword
fields carry lowercase_normalizer (testCaseResult.testCaseStatus in every language,
tier.tagFQN in en only), so the bucket key is lowercased while the feature still
returns the asset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
en uses lowercase_normalizer on 45 keyword field paths (tier.tagFQN,
displayName.keyword, owners.name, dataProducts.*, ...) so those fields filter, sort
and aggregate case-insensitively. jp/ru/zh were missing the normalizer on those
fields, so the same Explore facets, RBAC rules and Data Quality widgets behaved
case-sensitively — and inconsistently with en — for non-English deployments.

Add lowercase_normalizer to the matching keyword fields in jp/ru/zh wherever en has
it (216 occurrences across 122 files). The normalizer is already defined in every
file, so this only adds references. Surfaced by SearchConsumerFieldBehaviorIT (tier
aggregation diverged en vs jp/ru).

Side effects: 3 files (jp/zh test_case, jp test_suite) had a duplicate
settings.index key collapsed. 57 further en-only normalizers could not be aligned
because the field is missing or non-keyword in the other language — a separate
structural drift left for follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the third-party IK plugin (analysis-ik, ik_max_word/ik_smart) to the IT
OpenSearch image alongside analysis-kuromoji, with the release URL derived from the
base image tag so it always matches the OpenSearch version under test. Chinese
mappings now index with their real analyzers in CI.

Drop the analysis stripping from SearchConsumerFieldBehaviorIT entirely: all four
languages now create and query with their real analyzers (en/ru built-in, jp
kuromoji, zh IK), and the test asserts both plugins are installed. Every per-language
search-consumer check now runs against real analyzers in one IT, with no need to run
the whole integration suite once per language.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e enum

Replace the hardcoded ["en","jp","ru","zh"] list with IndexMappingLanguage.values()
mapped through the same toString().toLowerCase() the loader uses to pick a per-language
mapping file. A newly supported language is then exercised automatically, and a
declared language with no mapping files fails fast in createIndex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rage

Extend SearchConsumerFieldBehaviorIT with the real queries the DQ dashboard, test
case list and Incident Manager run, validated across every language with real
analyzers:
- test_case index: nested testSuites.id filter, entityLink.nonNormalized per-entity
  execution-summary aggregation, dataQualityDimension and testPlatforms filters.
- test_case_resolution_status index (new): testCaseResolutionStatusType,
  testCaseResolutionStatusDetails.assignee.name, and the
  testCase.fullyQualifiedName.keyword / testCase.entityFQN.keyword filters.

Field paths and query shapes mirror the backend SearchListFilter / TestSuiteRepository
and the UI DQ/incident filter builders. 16 capabilities now run per language.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds MappingDriftState enum (CURRENT/STALE/UNTRACKED) and computeDrift()
method to classify each entity's index mapping against the stored hash,
enabling /system/validate to detect stale deployed indices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce `updateMappingVersions(Collection<String>)` overload and extract
shared `stamp()` helper so only the explicitly provided entity types are
persisted; the no-arg method continues to stamp all entities as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tokens

Add a per-language _analyze check on the description field (kuromoji for jp, IK for
zh, stemmer for en/ru) asserting native-script text segments into the token the
analyzer must produce (東京 / 销售 / продаж / revenue). Keyword and identifier fields
are language-neutral and never exercise the text analyzers, so this is the only
assertion that the per-language analyzers — the whole reason the per-language
mappings exist — actually work. Tokens were verified against the real field
analyzers (e.g. ru stems продажи -> продаж; jp drops the の particle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… field check

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Track which entity indexes were successfully recreated in createIndexes()
and stamp their mapping versions via IndexMappingVersionTracker, skipping
any that threw during finalizeReindex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…led/CLI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…exApp owns it)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…removal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pmbrull pmbrull marked this pull request as ready for review June 9, 2026 06:42
Copilot AI review requested due to automatic review settings June 9, 2026 06:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Comment on lines +894 to +907
@VisibleForTesting
List<String> findOrphanIndexes(SearchRepository searchRepository) {
List<String> orphans = new ArrayList<>();
try {
OrphanedIndexCleaner cleaner = new OrphanedIndexCleaner();
for (OrphanedIndexCleaner.OrphanedIndex orphan :
cleaner.findOrphanedRebuildIndices(searchRepository.getSearchClient())) {
orphans.add(orphan.indexName());
}
} catch (Exception e) {
LOG.warn("Failed to check for orphan indexes: {}", e.getMessage());
}
return orphans;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Orphan check labels in-flight rebuild indexes as "safe to clean"

findOrphanIndexes reuses OrphanedIndexCleaner.findOrphanedRebuildIndices, which flags any _rebuild_ index older than 30 minutes (MIN_AGE_MS) that currently has no alias. A long-running reindex (large datasets routinely exceed 30 minutes) produces a rebuild index that is older than the threshold but not yet alias-swapped. During that window the /system/validate message will report it as an "orphan index(es) with no alias (safe to clean)", which is misleading for an operator who may then delete an index that is actively being built. Consider cross-checking against active reindex jobs, or softening the "safe to clean" wording to acknowledge in-progress rebuilds.

Sources: SystemRepository.java:894-907,935-940; OrphanedIndexCleaner.java:39,59-72.

Was this helpful? React with 👍 / 👎

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Comment on lines +856 to +858
boolean reindexNeeded() {
return !stalePending.isEmpty() || !missingIndexes.isEmpty();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Degraded cluster health does not fail the Reindex Status step

reindexNeeded() (SystemRepository.java:856-858) is defined as !stalePending.isEmpty() || !missingIndexes.isEmpty() and the step result uses step.withPassed(!status.reindexNeeded()) (line 991). clusterHealthy is intentionally excluded, so when the search cluster is degraded the step still reports passed=true while the message appends "WARNING: search cluster health is degraded." (appendClusterState, lines 931-933).

This is defensible for a check named "Reindex Status" (a degraded cluster does not imply a reindex is pending), but operators/automation that key off the boolean passed flag rather than parsing the message text will treat a degraded cluster as fully healthy. If that surfacing is intended, no change is needed; otherwise consider either surfacing cluster health as its own StepValidation or having the consumer of /system/validate treat the WARNING text as non-green. Flagging as minor since the warning is still present in the message.

Was this helpful? React with 👍 / 👎

@pmbrull

pmbrull commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Split into two focused PRs:

Closing this combined draft in favor of those two.

@pmbrull pmbrull closed this Jun 9, 2026
@gitar-bot

gitar-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Code Review 👍 Approved with suggestions 5 resolved / 8 findings

Implements a robust search indexing contract and health check, ensuring schema consistency across languages while surfacing pending reindex states. Please address the minor findings regarding cluster health warnings and orphan index classification to ensure they correctly impact the system validation status.

💡 Edge Case: Degraded search cluster reported as warning but step still passes

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:835-838 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:909-919

In getSearchValidation, the step's pass/fail is withPassed(missingIndexes.isEmpty()). A degraded cluster (isSearchClusterHealthy returns false, i.e. Elasticsearch/OpenSearch reports RED) and orphan indexes are only appended to the message text — they never flip passed to false. A RED search cluster is arguably as severe as (or worse than) a missing index, yet /system/validate will still mark the Search step as passed=true. Automated consumers that key off the boolean rather than parsing the message string will miss a degraded cluster. If this is intentional (non-blocking by design, as the PR states), consider documenting it; otherwise factor cluster health into the passed condition.

Sources: SystemRepository.java:835-838, 909-919.

💡 Edge Case: Orphan check labels in-flight rebuild indexes as "safe to clean"

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:894-907 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:935-940

findOrphanIndexes reuses OrphanedIndexCleaner.findOrphanedRebuildIndices, which flags any _rebuild_ index older than 30 minutes (MIN_AGE_MS) that currently has no alias. A long-running reindex (large datasets routinely exceed 30 minutes) produces a rebuild index that is older than the threshold but not yet alias-swapped. During that window the /system/validate message will report it as an "orphan index(es) with no alias (safe to clean)", which is misleading for an operator who may then delete an index that is actively being built. Consider cross-checking against active reindex jobs, or softening the "safe to clean" wording to acknowledge in-progress rebuilds.

Sources: SystemRepository.java:894-907,935-940; OrphanedIndexCleaner.java:39,59-72.

💡 Edge Case: Degraded cluster health does not fail the Reindex Status step

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:856-858 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:931-933 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:989-991

reindexNeeded() (SystemRepository.java:856-858) is defined as !stalePending.isEmpty() || !missingIndexes.isEmpty() and the step result uses step.withPassed(!status.reindexNeeded()) (line 991). clusterHealthy is intentionally excluded, so when the search cluster is degraded the step still reports passed=true while the message appends "WARNING: search cluster health is degraded." (appendClusterState, lines 931-933).

This is defensible for a check named "Reindex Status" (a degraded cluster does not imply a reindex is pending), but operators/automation that key off the boolean passed flag rather than parsing the message text will treat a degraded cluster as fully healthy. If that surfacing is intended, no change is needed; otherwise consider either surfacing cluster health as its own StepValidation or having the consumer of /system/validate treat the WARNING text as non-green. Flagging as minor since the warning is still present in the message.

✅ 5 resolved
Edge Case: Health check treats unreadable/empty live mappings as healthy

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:920-934 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java:67-81
getIndexFieldNames swallows any exception and returns an empty set, and collectMissingConsumerFields skips an index entirely when liveFields.isEmpty(). This intentionally avoids double-reporting a not-yet-created index, but it also means a transient failure reading a deployed index's mapping (cluster timeout, permission error, partial outage) is indistinguishable from "index absent" — both result in the index being skipped and the overall step reporting passed=true ("All core data asset indexes expose the fields..."). A genuinely stale/broken index can therefore be masked behind a green check whenever the mapping read fails. Since this is a non-blocking advisory check the impact is limited, but consider distinguishing "index missing" from "mapping read failed" (e.g. surface a separate "could not verify N indexes" note) so transient errors don't produce a false all-clear.

Edge Case: jp/api_collection repointed to English analyzer, not kuromoji

📄 openmetadata-spec/src/main/resources/elasticsearch/jp/api_collection_index_mapping.json:27-35 📄 openmetadata-spec/src/main/resources/elasticsearch/jp/api_collection_index_mapping.json:93-95 📄 openmetadata-spec/src/main/resources/elasticsearch/jp/api_collection_index_mapping.json:119-121 📄 openmetadata-spec/src/main/resources/elasticsearch/jp/api_collection_index_mapping.json:142-144
This commit repoints analyzer references the right way for almost every jp index: those files define om_analyzer_jp (kuromoji) in settings.analysis.analyzer but had dangling field references to om_analyzer, so they were correctly changed to om_analyzer_jp. jp/api_collection_index_mapping.json is the lone exception — it went the opposite direction (om_analyzer_jp -> om_analyzer) on name, displayName, and description.

Root cause: unlike every other jp mapping, api_collection's settings block was never localized — it still defines om_analyzer with the English kstem stemmer (lines 27-56) rather than om_analyzer_jp with the kuromoji tokenizer/filters that topic, container, glossary, etc. define. Because the only analyzer actually declared in this file is om_analyzer, the fix made the field references consistent with the (un-localized) settings rather than fixing the settings.

Impact: the dangling-reference fix is correct (referencing an undefined om_analyzer_jp would have failed index creation), but the result is that JP full-text search over API Collections (name/displayName/description) silently uses English tokenization instead of Japanese — inconsistent with the rest of the jp locale. The denormalized consumer fields (keyword/nested) are unaffected, so RBAC/DQ still work; this is a search-quality regression for Japanese text on this one index.

Suggested fix: localize api_collection's settings to define om_analyzer_jp (kuromoji) matching the other jp indexes, and keep the three field references at om_analyzer_jp rather than downgrading them to om_analyzer.

Quality: Plugin install adds a build-time network dependency to all IT runs

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/server/SearchTestImages.java:31-44 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/bootstrap/TestSuiteBootstrap.java:353-355 📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/server/ContainerizedServer.java:226-228
SearchTestImages.openSearchWithAnalysisPlugins is now wired into the main IT bootstrap (TestSuiteBootstrap.startSearch) and the full production-stack runner (ContainerizedServer.newOpenSearch), not just the new language-analyzer test. Because it builds an image whose Dockerfile runs opensearch-plugin install --batch analysis-kuromoji, every OpenSearch-based integration test run now requires the OpenSearch plugin artifact repository to be reachable at image-build time. If that registry is unavailable (offline/air-gapped CI, transient outage), the image build fails and the entire OpenSearch IT suite fails to start — not only the per-language analyzer coverage. The docstring documents installing kuromoji unconditionally as intentional, but coupling the whole suite's startup to an external download is a new failure mode. Consider gating the plugin-augmented image behind a flag/system property (defaulting to the language IT only), or pre-baking the plugin into a published base image so routine IT runs don't depend on the plugin registry being online.

Performance: validateSystem issues up to 10 uncached getMapping round-trips

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:871-885
getMappingConsistencyValidation is invoked on every /system/validate call and performs one synchronous getMapping network round-trip per entry in CANARY_DATA_ASSET_ENTITIES (10 indexes), with no caching. This is acceptable for a manually triggered admin endpoint, but if /system/validate is ever polled by monitoring/health tooling the added latency and ES/OS load become noticeable. Consider a short-lived cache of the result (similar to the existing HEALTH_CHECK_CACHE_MS pattern in the search clients) or documenting that this endpoint is admin-only/low-frequency.

Edge Case: Contract test misses denormalized object retyped to a scalar

📄 openmetadata-service/src/test/java/org/openmetadata/service/search/SearchConsumerFieldContractTest.java:105-119 📄 openmetadata-service/src/test/java/org/openmetadata/service/search/SearchConsumerFieldContractTest.java:126-139 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchConsumerFields.java:44-58
The two guards have a blind spot for the most likely "retype" regression on the top-level denormalized objects (tags, tier, domains, certification). coreDataAssetIndexesExposeConsumerFields only checks properties.has(required) — key presence, no type. consumerLeafFieldsKeepTheirTypeWherePresent validates leaf paths like tags.tagFQN / domains.fullyQualifiedName, but findField walks into node.path("properties") for non-terminal segments; if the parent object is retyped to a scalar (e.g. tags changed from an object to {"type":"keyword"}), it has no properties, the loop stops, found is null, and the field is silently treated as "not present" (skipped). Meanwhile the CANARY presence check still passes because the tags key exists. So a mapping change that flattens tags/tier/domains/certification from an object to a scalar — exactly the kind of silent break the contract is meant to catch for RBAC/DQ/Lineage — passes CI green. Consider asserting the parent object's structure (e.g. that the field declares properties / is not typed as a scalar) for these top-level denormalized objects, or distinguishing "path absent" from "path truncated by an unexpected scalar parent" in findField so the latter is reported as a violation.

🤖 Prompt for agents
Code Review: Implements a robust search indexing contract and health check, ensuring schema consistency across languages while surfacing pending reindex states. Please address the minor findings regarding cluster health warnings and orphan index classification to ensure they correctly impact the system validation status.

1. 💡 Edge Case: Degraded search cluster reported as warning but step still passes
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:835-838, openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:909-919

   In `getSearchValidation`, the step's pass/fail is `withPassed(missingIndexes.isEmpty())`. A degraded cluster (`isSearchClusterHealthy` returns false, i.e. Elasticsearch/OpenSearch reports RED) and orphan indexes are only appended to the message text — they never flip `passed` to false. A RED search cluster is arguably as severe as (or worse than) a missing index, yet `/system/validate` will still mark the Search step as passed=true. Automated consumers that key off the boolean rather than parsing the message string will miss a degraded cluster. If this is intentional (non-blocking by design, as the PR states), consider documenting it; otherwise factor cluster health into the passed condition.
   
   Sources: SystemRepository.java:835-838, 909-919.

2. 💡 Edge Case: Orphan check labels in-flight rebuild indexes as "safe to clean"
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:894-907, openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:935-940

   `findOrphanIndexes` reuses `OrphanedIndexCleaner.findOrphanedRebuildIndices`, which flags any `_rebuild_` index older than 30 minutes (`MIN_AGE_MS`) that currently has no alias. A long-running reindex (large datasets routinely exceed 30 minutes) produces a rebuild index that is older than the threshold but not yet alias-swapped. During that window the `/system/validate` message will report it as an "orphan index(es) with no alias (safe to clean)", which is misleading for an operator who may then delete an index that is actively being built. Consider cross-checking against active reindex jobs, or softening the "safe to clean" wording to acknowledge in-progress rebuilds.
   
   Sources: SystemRepository.java:894-907,935-940; OrphanedIndexCleaner.java:39,59-72.

3. 💡 Edge Case: Degraded cluster health does not fail the Reindex Status step
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:856-858, openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:931-933, openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:989-991

   `reindexNeeded()` (SystemRepository.java:856-858) is defined as `!stalePending.isEmpty() || !missingIndexes.isEmpty()` and the step result uses `step.withPassed(!status.reindexNeeded())` (line 991). `clusterHealthy` is intentionally excluded, so when the search cluster is degraded the step still reports `passed=true` while the message appends "WARNING: search cluster health is degraded." (appendClusterState, lines 931-933).
   
   This is defensible for a check named "Reindex Status" (a degraded cluster does not imply a reindex is pending), but operators/automation that key off the boolean `passed` flag rather than parsing the message text will treat a degraded cluster as fully healthy. If that surfacing is intended, no change is needed; otherwise consider either surfacing cluster health as its own StepValidation or having the consumer of `/system/validate` treat the WARNING text as non-green. Flagging as minor since the warning is still present in the message.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

sonarqubecloud Bot commented Jun 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (12 flaky)

✅ 4268 passed · ❌ 0 failed · 🟡 12 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 301 0 0 4
🟡 Shard 2 803 0 1 9
🟡 Shard 3 802 0 3 8
🟡 Shard 4 852 0 3 12
🟡 Shard 5 720 0 1 47
🟡 Shard 6 790 0 4 8
🟡 12 flaky test(s) (passed on retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/SettingsNavigationPage.spec.ts › should support drag and drop reordering of navigation items (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on pipeline (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Should verify property name is visible for apiCollection in right panel (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Entity Reference (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 2 retries)
  • Pages/ExplorePageRightPanel.spec.ts › Should allow Data Consumer to edit description for searchIndex (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for apiEndpoint -> dashboard (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search - Add index drift validation & multi-lang tests

2 participants