Skip to content

[bug](iceberg) fix iceberg sink writer with spill report error#62899

Merged
Gabriel39 merged 2 commits into
apache:masterfrom
zhangstar333:iceberg_spill_error
May 25, 2026
Merged

[bug](iceberg) fix iceberg sink writer with spill report error#62899
Gabriel39 merged 2 commits into
apache:masterfrom
zhangstar333:iceberg_spill_error

Conversation

@zhangstar333
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Problem Summary:

spill thread with write thread is different, so add mutex in the write function.

mysql> create table web_sales
    -> order by (ws_sold_date_sk, ws_item_sk, ws_order_number)
    -> PROPERTIES (
    ->   'write-format'='parquet'
    -> )
    -> as select * from tpcds_sf1000_parquet.web_sales;
ERROR 1105 (HY000): errCode = 2, detailMessage = (172.20.49.239)[INTERNAL_ERROR][E6] Size of permutation (40800) is less than required (48960)        0#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> > const&, bool) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/common/exception.cpp:0
        1#  doris::Exception::Exception(int, std::basic_string_view<char, std::char_traits<char> > const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/basic_stri
mysql>

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@Gabriel39
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the concurrency fix. The PR correctly tries to serialize FullSorter operations, but the spill path still reads VIcebergTableWriter::_current_writer without synchronization while the async writer thread assigns to the same shared_ptr, so the race can happen before VIcebergSortWriter::_sorter_mutex is reached.

Critical checkpoint conclusions:

  • Goal/test: The goal is to fix Iceberg sort writer spill errors under memory revocation. The current code only partially accomplishes it; no PR test or reproduction coverage was added.
  • Scope: The code change is small and focused, but the synchronization boundary is incomplete.
  • Concurrency: Yes. The async writer thread performs write/close, while workload memory revocation can invoke revoke_memory concurrently. FullSorter access is now locked, but current_writer lifetime/access is not protected.
  • Lifecycle: The new _closed flag is guarded by the sorter mutex and prevents late spills after close; no separate lifecycle issue found beyond the unsynchronized current_writer access.
  • Config/compatibility/parallel paths: No new configs, storage formats, or FE/BE protocol changes. No additional parallel code path issue found.
  • Error handling and observability: Existing Status propagation is preserved; no new observability requirement identified.
  • Tests: No new test coverage was added for concurrent spill/write/close. Given the race being fixed, a targeted concurrent test or regression coverage would be valuable.
  • Performance: The new mutex serializes spill/write/close and can block revocation behind file I/O, but that appears acceptable for correctness; the blocking issue is the uncovered shared_ptr race.

User focus: no additional user-provided review focus was supplied.


auto* sort_writer = dynamic_cast<VIcebergSortWriter*>(_writer->current_writer().get());
if (!sort_writer || !sort_writer->sorter()) {
if (!sort_writer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still reads _writer->current_writer() without synchronization before the new VIcebergSortWriter mutex can help. current_writer() returns a reference to _current_writer, while the async writer thread assigns _current_writer = writer in VIcebergTableWriter::_write_prepared_block() after each write. Workload memory revocation can enter this path concurrently with that assignment, so the shared_ptr object itself is read and written at the same time, which is undefined behavior and can still cause the spill-time crash this PR is trying to fix. Please protect _current_writer with the same synchronization boundary (or return a synchronized copy / atomic shared_ptr) before doing the dynamic_cast.


#include <cstdint>
#include <limits>
#include <mutex>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing <limits> here drops the only explicit include that made this component's std::numeric_limits uses visible. viceberg_sort_writer.cpp still calls std::numeric_limits in _get_spill_batch_size() and _calc_max_merge_streams(), but it does not include <limits> itself, so this now relies on unrelated transitive includes and can fail with include-order changes. Please add #include <limits> to the .cpp or keep it in this header.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/25534927641

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes based on the already-open inline thread that still applies to the latest diff. I did not add a duplicate inline comment for it.

Existing unresolved issue:

  • be/src/exec/sink/writer/iceberg/viceberg_sort_writer.h removed the only explicit <limits> include while viceberg_sort_writer.cpp still uses std::numeric_limits. This remains a compile-hygiene risk and should be fixed by including <limits> in the .cpp or keeping it in the header.

Existing issue that appears addressed by the latest commit:

  • The prior unsynchronized _current_writer access is now changed to atomic_shared_ptr with current_writer() returning a by-value snapshot, and the revoke path keeps that snapshot alive while operating on the sort writer.

Critical checkpoint conclusions:

  • Goal/test: The goal is to avoid races between Iceberg sort writer writes and spill revocation. The mutex plus atomic current-writer snapshot mostly addresses the intended race. No new test or manual verification evidence was added in the PR.
  • Scope: The latest PR diff is focused on the Iceberg spill writer race.
  • Concurrency: FullSorter accesses are serialized by _sorter_mutex, and _current_writer is now loaded/stored atomically. Holding the mutex across spill/write/close is conservative but avoids concurrent sorter mutation.
  • Lifecycle: _closed is protected by the same mutex and makes late revoke spills no-op after close. No new static initialization or circular ownership hazard found.
  • Configuration/compatibility: No config, storage format, or FE/BE protocol compatibility changes found.
  • Parallel paths: The affected spill memory paths now use VIcebergSortWriter accessors instead of raw sorter access.
  • Conditional checks: New null/type checks are straightforward.
  • Test coverage: No test was added for the concurrency fix; residual risk remains because this is a race-prone path.
  • Observability: No new observability needed beyond existing status/log paths for this narrow fix.
  • Transactions/persistence/data writes: The change affects write/spill serialization but does not alter transaction metadata or persistence. The unresolved include issue should be fixed before merge.
  • Performance: The coarse mutex can block revocation memory queries behind write/spill I/O, but I did not find a correctness blocker from that pattern in this PR.

User focus: No additional user-provided review focus was supplied.

@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@zhangstar333 zhangstar333 force-pushed the iceberg_spill_error branch from 5a8f3bf to 2dfd2dd Compare May 25, 2026 03:03
@zhangstar333
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31560 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2dfd2dda5a6e54a1df731f1932cca516506b054b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17588	3999	4016	3999
q2	q3	10847	1386	811	811
q4	4696	478	350	350
q5	7697	2273	2091	2091
q6	253	179	140	140
q7	941	790	643	643
q8	9375	1789	1621	1621
q9	5204	4980	4983	4980
q10	6428	2229	1856	1856
q11	445	278	250	250
q12	634	425	295	295
q13	18146	3340	2776	2776
q14	267	260	248	248
q15	q16	826	780	716	716
q17	970	984	963	963
q18	6952	5766	5500	5500
q19	1381	1281	1102	1102
q20	632	431	300	300
q21	6040	2794	2609	2609
q22	457	488	310	310
Total cold run time: 99779 ms
Total hot run time: 31560 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4969	4822	4730	4730
q2	q3	4955	5343	4735	4735
q4	2162	2252	1425	1425
q5	4983	4773	4747	4747
q6	228	180	130	130
q7	1866	1829	1601	1601
q8	2463	2144	2156	2144
q9	7937	7609	7413	7413
q10	4770	4731	4288	4288
q11	561	404	371	371
q12	744	756	547	547
q13	3007	3366	2793	2793
q14	274	276	249	249
q15	q16	682	720	625	625
q17	1310	1270	1273	1270
q18	7116	6732	6973	6732
q19	1138	1094	1107	1094
q20	2230	2241	1961	1961
q21	5354	4664	4560	4560
q22	551	464	402	402
Total cold run time: 57300 ms
Total hot run time: 51817 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173094 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2dfd2dda5a6e54a1df731f1932cca516506b054b, data reload: false

query5	4347	656	517	517
query6	400	217	195	195
query7	4238	554	320	320
query8	335	236	230	230
query9	8841	4152	4133	4133
query10	452	350	291	291
query11	5824	2464	2238	2238
query12	182	129	126	126
query13	1289	617	436	436
query14	6209	5610	5262	5262
query14_1	4536	4536	4516	4516
query15	214	210	186	186
query16	1030	477	441	441
query17	1168	750	626	626
query18	2506	493	367	367
query19	256	217	166	166
query20	146	134	129	129
query21	224	139	119	119
query22	13741	13729	13442	13442
query23	17390	16593	16286	16286
query23_1	16481	16457	16487	16457
query24	7421	1786	1330	1330
query24_1	1337	1333	1331	1331
query25	580	505	449	449
query26	1316	347	179	179
query27	2654	530	346	346
query28	4406	2016	2005	2005
query29	1045	657	518	518
query30	311	239	205	205
query31	1145	1079	961	961
query32	89	81	77	77
query33	568	376	312	312
query34	1209	1180	659	659
query35	792	786	697	697
query36	1393	1403	1306	1306
query37	154	108	91	91
query38	3218	3162	3080	3080
query39	946	920	905	905
query39_1	864	885	874	874
query40	234	151	123	123
query41	66	66	63	63
query42	112	110	113	110
query43	342	342	303	303
query44	
query45	217	201	210	201
query46	1108	1211	741	741
query47	2366	2385	2258	2258
query48	385	424	315	315
query49	641	493	379	379
query50	1002	334	250	250
query51	4451	4349	4286	4286
query52	107	107	95	95
query53	257	288	201	201
query54	315	268	254	254
query55	94	93	86	86
query56	307	305	296	296
query57	1433	1423	1308	1308
query58	298	272	261	261
query59	1576	1661	1456	1456
query60	337	322	314	314
query61	162	159	151	151
query62	706	660	597	597
query63	246	198	210	198
query64	2384	825	634	634
query65	
query66	1719	483	369	369
query67	30060	30120	29994	29994
query68	
query69	478	350	303	303
query70	1032	1038	1013	1013
query71	305	278	259	259
query72	3032	2715	2397	2397
query73	897	780	445	445
query74	5135	4952	4881	4881
query75	2712	2606	2275	2275
query76	2362	1168	809	809
query77	413	425	350	350
query78	12484	12559	11751	11751
query79	1447	1085	756	756
query80	676	560	452	452
query81	451	285	249	249
query82	1372	160	120	120
query83	367	275	254	254
query84	275	141	116	116
query85	894	534	465	465
query86	400	358	352	352
query87	3466	3401	3275	3275
query88	3706	2775	2739	2739
query89	460	392	344	344
query90	1959	188	182	182
query91	181	174	142	142
query92	81	81	77	77
query93	1443	1410	868	868
query94	541	356	309	309
query95	700	377	347	347
query96	1119	830	372	372
query97	2760	2771	2595	2595
query98	240	236	234	234
query99	1216	1154	1033	1033
Total cold run time: 255529 ms
Total hot run time: 173094 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/43) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.78% (20854/38780)
Line Coverage 37.35% (197550/528858)
Region Coverage 33.67% (154790/459730)
Branch Coverage 34.65% (67353/194402)

@zhangstar333
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes because the existing inline include-hygiene issue is still unresolved. I did not add duplicate inline comments.

Existing unresolved issue:

  • be/src/exec/sink/writer/iceberg/viceberg_sort_writer.h removed the only explicit <limits> include, while be/src/exec/sink/writer/iceberg/viceberg_sort_writer.cpp still uses std::numeric_limits in _get_spill_batch_size() and _calc_max_merge_streams(). Please resolve the already-open inline thread by adding #include <limits> to the .cpp or keeping it in the header.

Previously raised concurrency issue status:

  • The earlier _current_writer race appears addressed in the latest commit: _current_writer is now an atomic_shared_ptr, current_writer() returns a by-value snapshot, and revoke_memory() keeps that snapshot alive while calling trigger_spill(). I found no distinct remaining instance of that same race.

Critical checkpoint conclusions:

  • Goal/test: The goal is to fix Iceberg sort-writer races between write and spill revocation. The latest code largely accomplishes the synchronization goal, but the PR still has no new automated or manual test evidence for this race-prone path.
  • Scope: The change is small and focused on Iceberg spill writer synchronization.
  • Concurrency: This PR does involve concurrency between the async writer path and memory revocation. FullSorter access is serialized with _sorter_mutex; _closed is protected by the same mutex; _current_writer is now loaded/stored atomically. Holding the mutex across write/spill/close is conservative but acceptable for correctness in this narrow fix.
  • Lifecycle/static initialization: No new static initialization hazard found. The by-value current-writer snapshot preserves object lifetime during revoke, and _closed prevents late spills after close.
  • Config/compatibility: No new config, storage format, or FE/BE protocol compatibility change.
  • Parallel paths: The affected Iceberg spill memory paths now use VIcebergSortWriter accessors rather than raw sorter() access. I found no other modified parallel path requiring the same change.
  • Conditional checks: The new null/type checks are straightforward; no non-obvious condition requiring more explanation was found.
  • Tests/results: No regression/unit test or result file was added. Residual risk remains because this is a concurrency bug fix.
  • Observability: Existing status/logging paths appear sufficient for this narrow change; no new metric/log requirement found.
  • Transactions/persistence/data writes: The change serializes write/spill operations but does not change transaction metadata or persistence.
  • Performance: The coarse mutex may block memory queries/spill behind write or close work, but I did not find a correctness blocker from that tradeoff.
  • Other issues: The unresolved <limits> include issue remains and should be fixed before merge.

User focus: No additional user-provided review focus was supplied.

@Gabriel39 Gabriel39 merged commit 5cc48ea into apache:master May 25, 2026
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants