[fix](variant) Keep first duplicate Variant JSON path#63082
[fix](variant) Keep first duplicate Variant JSON path#63082eldenmoon wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a backend config to change how VARIANT parsing handles duplicate JSON object keys: when enabled, it keeps the first occurrence instead of erroring on duplicate paths. It also adds test coverage for this behavior, while removing several existing VARIANT nested regression suites and adding a new export/load regression for wide VARIANT data.
Changes:
- Add BE config
variant_enable_duplicate_json_path_check(default off) and implement “keep first duplicate key” behavior during JSON traversal for VARIANT extraction. - Add BE unit tests and a VARIANT regression test + data file to validate duplicate-key behavior (insert + stream load).
- Add an export/load regression test for VARIANT with many potential subcolumns; remove multiple VARIANT nested regression suites.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/nested2.groovy | Deleted an existing VARIANT nested-type conflict regression suite. |
| regression-test/suites/variant_p0/nested/sql/q01.sql | Deleted SQL outputs for a removed nested VARIANT load suite. |
| regression-test/suites/variant_p0/nested/nested_in_top_array.groovy | Deleted a regression suite covering nested-in-top-array VARIANT behavior. |
| regression-test/suites/variant_p0/nested/load.groovy | Deleted a (mostly commented) nested VARIANT load regression suite. |
| regression-test/suites/variant_p0/nested.groovy | Deleted a large VARIANT nested regression suite. |
| regression-test/suites/variant_p0/duplicate_json_path.groovy | Added regression coverage for keeping the first duplicate JSON key when the new BE config is enabled. |
| regression-test/suites/export_p2/test_export_variant_10k_columns.groovy | Added export→S3→reload regression for wide/sparse VARIANT; currently has correctness and cleanup issues. |
| regression-test/data/variant_p0/duplicate_json_path.json | Added stream-load input with duplicate JSON keys. |
| be/test/storage/segment/variant_util_test.cpp | Added unit tests validating “keep first duplicate key” behavior in both subcolumn-only and doc-mode parsing. |
| be/src/util/json/json_parser.cpp | Added duplicate-key skipping logic during object traversal when the config is enabled. |
| be/src/common/config.h | Declared the new BE config flag. |
| be/src/common/config.cpp | Defined the new BE config flag with default false. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!config::variant_enable_duplicate_json_path_check) { | ||
| for (auto it = object.begin(); it != object.end(); ++it) { | ||
| const auto& [key, value] = *it; | ||
| traverse_object_member(key, value); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| std::unordered_set<PathInData, PathInData::Hash> visited_member_paths; | ||
| visited_member_paths.reserve(object.size()); | ||
| for (auto it = object.begin(); it != object.end(); ++it) { | ||
| const auto& [key, value] = *it; | ||
| check_key_length(key); | ||
| ctx.builder.append(key, false); | ||
| PathInData member_path(ctx.builder.get_parts()); | ||
| if (!visited_member_paths.insert(std::move(member_path)).second) { | ||
| ctx.builder.pop_back(); | ||
| continue; | ||
| } | ||
| traverse(value, ctx); | ||
| ctx.builder.pop_back(); | ||
| } |
| suite("duplicate_json_path", "p0") { | ||
| def customBeConfig = [ | ||
| variant_enable_duplicate_json_path_check: true | ||
| ] | ||
| setBeConfigTemporary(customBeConfig) { |
a7bc101 to
abc6911
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional blocking correctness issue beyond the already-known review threads. The enabled duplicate-path check still allows flat dotted keys and nested object paths to produce the same downstream Variant path string, so the load/WAL failure this PR is meant to prevent can still occur for inputs such as {"a.b":1,"a":{"b":2}}.
Critical checkpoint conclusions:
- Goal/test: The goal is to optionally keep the first duplicate Variant JSON path and avoid duplicate/conflicting subcolumn paths; current tests cover repeated object member names but not dotted-key versus nested-path conflicts, and the implementation does not fully accomplish the goal.
- Scope: The change is small and focused, but the local duplicate key check is too narrow for Variant path identity.
- Concurrency/lifecycle: No new concurrency or special lifecycle management is introduced; the mutable config follows existing BE config patterns.
- Configuration: A disabled-by-default mutable BE config is added; behavior is observed at parse time.
- Compatibility: No storage/protocol format change was found.
- Parallel paths: Existing review thread already covers the JSONB conversion path separately; this comment covers a distinct subcolumn/doc-value path conflict.
- Tests: Added unit/regression coverage misses the dotted-key conflict case; no tests were run by me in this automated review.
- Observability/performance: No additional observability requirement found; the extra check is not on an obviously problematic hot path when disabled.
- User focus: No additional user-provided review focus was supplied.
961fd01 to
99aa5b6
Compare
|
run buildall |
|
/review |
### What problem does this PR solve?
Issue Number: CIR-19941
Related PR: None
Problem Summary: Variant JSON parsing could emit duplicate or conflicting downstream subcolumn paths when duplicate object keys, flat dotted keys, nested object paths, or array/scalar leaf values normalize to the same Variant path. This could later fail load or WAL replay with subcolumn type size mismatch. Add a disabled-by-default BE config to enable duplicate JSON path checking; when enabled, duplicate leaf paths keep the first value while object entries that expand to distinct child paths are preserved. The duplicate-check setting is captured in ParseConfig so parser and materialization use the same per-call behavior.
### Release note
Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate Variant JSON leaf path when enabled; default is false.
### Check List (For Author)
- Test: Unit Test and Regression test
- Unit Test: ./run-be-ut.sh --run --filter='VariantUtilTest.ParseDuplicateJsonPaths*'
- Build: source ./custom_env.sh; export DORIS_PARALLELISM="${DORIS_PARALLELISM:-39}"; export BUILD_TYPE="${BUILD_TYPE:-ASAN}"; export USE_MEM_TRACKER=ON; ./build.sh --be -j"${DORIS_PARALLELISM}"
- Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,172.19.0.1 no_proxy=127.0.0.1,localhost,172.19.0.1 ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s duplicate_json_path
- Review: GPT-5.5 xhigh subagent final review found no P0/P1 blocking issues
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate downstream leaf paths keep the first value and duplicate object entries can merge distinct child paths; default is false.
- Does this need documentation: No
99aa5b6 to
102aad0
Compare
There was a problem hiding this comment.
Review conclusion: request changes. I did not add duplicate inline comments because the remaining blocking concern is already covered by an existing thread on be/src/util/json/json_parser.cpp: traverseObjectAsJsonb() still preserves duplicate object keys inside JSONB output when variant_enable_duplicate_json_path_check is enabled, so nested arrays/objects converted through the JSONB path do not consistently keep the first duplicate key.
Critical checkpoints:
- Goal/test: The PR aims to make duplicate Variant JSON paths keep the first value when enabled. The added BE unit and regression tests cover subcolumn/doc materialization, flat dotted keys, object/scalar conflicts, insert, stream load, and compaction, but the JSONB conversion path remains uncovered and inconsistent with the config comment.
- Scope/focus: The code is mostly focused on Variant parsing and materialization; no unrelated current PR file changes were found in the GitHub PR diff.
- Concurrency/lifecycle: No new concurrent state or non-trivial lifecycle management is introduced. The new mutable config is read per parse/materialization call into
ParseConfig; no lock-order or thread lifecycle issue was found. - Configuration: A mutable BE config is added and propagated to the main deserialize/materialize paths, but behavior is not applied uniformly to JSONB serialization of nested arrays/objects.
- Compatibility/storage: No serialized format or FE/BE protocol format change was found. Default behavior remains disabled, so compatibility risk is limited.
- Parallel paths: The subcolumn/doc-value path is updated, but the JSONB path remains a parallel path with inconsistent duplicate handling.
- Tests/results: Tests are meaningful for the updated paths and use ordered query output. Missing coverage remains for JSONB duplicate-key handling under the new config.
- Observability/performance: No extra observability appears necessary. The duplicate path checks add per-row hash-set work only when enabled; no additional performance blocker found.
User focus: no additional user-provided review focus was specified.
Existing review context: The earlier flat-dot/nested path-collision concern appears addressed in the current head by parser-level get_path() deduplication plus path normalization/materialization tests. The earlier regression-test deletion concern appears stale for the current GitHub PR file list. The JSONB duplicate-key concern remains applicable and should be resolved before approval.
|
run buildall |
TPC-H: Total hot run time: 29950 ms |
TPC-DS: Total hot run time: 170427 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object key during Variant parsing when enabled.