Skip to content

[fix](load job) rebuild broker load storage properties after Gson replay#63094

Open
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_broker_load_replay
Open

[fix](load job) rebuild broker load storage properties after Gson replay#63094
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_broker_load_replay

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented May 9, 2026

What problem does this PR solve?

Problem Summary:

Broker load jobs may fail after FE restart, image load, or edit log replay in cloud mode.

StorageDesc.storageProperties is not persisted by Doris's Gson configuration, so after deserialization
the BrokerDesc inside BrokerLoadJob can keep storageType and raw properties but lose the
derived storageProperties object. When the pending task later tries to create a filesystem and list
source files, FE cannot reconstruct the expected storage backend state and the load may fail with
errors such as Unknown storage type.

This PR fixes the issue by rebuilding storageProperties from storageType, name, and properties
after Gson replay, and also lazily reinitializing it on access as a safety net.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented May 9, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29598 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 67210f896ce79a115db5ddddaa39d2f39d931f80, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17809	3923	3843	3843
q2	q3	10731	877	602	602
q4	4664	456	341	341
q5	7443	1320	1139	1139
q6	208	165	136	136
q7	907	950	739	739
q8	9724	1449	1295	1295
q9	6272	5347	5342	5342
q10	6308	2063	1816	1816
q11	476	260	254	254
q12	693	426	297	297
q13	18218	3312	2690	2690
q14	300	280	259	259
q15	q16	902	855	790	790
q17	1104	998	787	787
q18	6438	5630	5650	5630
q19	1718	1316	1133	1133
q20	513	405	252	252
q21	4811	2376	1922	1922
q22	485	397	331	331
Total cold run time: 99724 ms
Total hot run time: 29598 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4672	4532	4563	4532
q2	q3	4651	4797	4205	4205
q4	2124	2178	1401	1401
q5	5001	4942	5240	4942
q6	195	173	133	133
q7	2069	1819	1618	1618
q8	3349	3122	3109	3109
q9	8433	8562	8509	8509
q10	4572	4486	4253	4253
q11	589	441	412	412
q12	707	755	516	516
q13	3167	3621	2850	2850
q14	299	314	289	289
q15	q16	774	776	724	724
q17	1342	1313	1250	1250
q18	7947	7138	7063	7063
q19	1176	1181	1172	1172
q20	2202	2263	1945	1945
q21	6092	5303	4851	4851
q22	532	476	396	396
Total cold run time: 59893 ms
Total hot run time: 54170 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171632 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 67210f896ce79a115db5ddddaa39d2f39d931f80, data reload: false

query5	4327	666	511	511
query6	340	242	226	226
query7	4232	537	302	302
query8	328	236	215	215
query9	8862	4126	4102	4102
query10	457	349	312	312
query11	5805	2397	2224	2224
query12	190	133	129	129
query13	1309	613	431	431
query14	6705	5365	5048	5048
query14_1	4360	4388	4336	4336
query15	215	207	186	186
query16	1006	470	376	376
query17	1310	813	647	647
query18	2723	508	371	371
query19	253	209	173	173
query20	145	141	135	135
query21	221	136	118	118
query22	13573	13634	13344	13344
query23	17209	16401	16094	16094
query23_1	16218	16283	16239	16239
query24	7465	1790	1374	1374
query24_1	1353	1380	1363	1363
query25	597	538	473	473
query26	1488	324	184	184
query27	2644	615	363	363
query28	4850	2045	2046	2045
query29	1079	683	543	543
query30	308	245	205	205
query31	1171	1100	980	980
query32	91	88	79	79
query33	563	388	322	322
query34	1210	1196	675	675
query35	798	809	702	702
query36	1420	1449	1317	1317
query37	154	106	94	94
query38	3275	3181	3127	3127
query39	1014	983	908	908
query39_1	888	888	881	881
query40	242	161	134	134
query41	65	61	60	60
query42	111	112	112	112
query43	339	337	286	286
query44	
query45	216	200	204	200
query46	1096	1226	729	729
query47	2345	2347	2194	2194
query48	408	441	290	290
query49	632	551	429	429
query50	740	286	225	225
query51	4335	4260	4233	4233
query52	113	113	96	96
query53	259	277	208	208
query54	310	265	254	254
query55	92	88	82	82
query56	314	302	307	302
query57	1417	1410	1324	1324
query58	304	278	278	278
query59	1585	1650	1463	1463
query60	346	341	337	337
query61	161	156	155	155
query62	684	621	572	572
query63	257	206	203	203
query64	2024	837	674	674
query65	
query66	1654	519	393	393
query67	30021	29906	29792	29792
query68	
query69	508	343	313	313
query70	1038	1017	983	983
query71	304	272	292	272
query72	2958	2700	2431	2431
query73	858	771	437	437
query74	5085	4902	4734	4734
query75	2810	2651	2339	2339
query76	2303	1143	815	815
query77	443	459	355	355
query78	13006	13005	12463	12463
query79	2775	1021	766	766
query80	1776	595	491	491
query81	533	284	244	244
query82	909	159	121	121
query83	328	269	252	252
query84	263	134	116	116
query85	909	530	455	455
query86	477	323	333	323
query87	3437	3363	3211	3211
query88	3620	2706	2688	2688
query89	449	391	343	343
query90	1919	180	182	180
query91	179	166	141	141
query92	81	78	76	76
query93	1258	987	566	566
query94	711	327	300	300
query95	674	482	351	351
query96	984	769	344	344
query97	2709	2722	2568	2568
query98	240	246	232	232
query99	1094	1101	957	957
Total cold run time: 257700 ms
Total hot run time: 171632 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 70.00% (7/10) 🎉
Increment coverage report
Complete coverage report

@zclllyybb
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found in this review.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets rebuilding StorageDesc.storageProperties after Gson replay for broker load descriptors. The added GsonPostProcessable hook plus lazy getter initialization addresses the replay path, and the new unit test covers direct BrokerDesc replay and BrokerLoadJob replay with S3-backed properties.
  • Scope: The change is small and focused on reconstructing derived storage state from already persisted fields.
  • Concurrency/lifecycle: No new concurrency or locking behavior is introduced. Lifecycle impact is limited to Gson deserialization and lazy access after replay.
  • Persistence/edit log: No new persisted field is added. The reconstruction derives state from existing persisted name, storageType, and properties, so image/edit-log compatibility is preserved.
  • Compatibility: Existing serialized jobs should still deserialize because the new logic tolerates null properties and preserves broker-vs-refactored storage behavior for the reviewed paths.
  • Error handling: Reconstruction uses existing StorageProperties/BrokerProperties validation behavior; no swallowed failure path was added.
  • Performance: Only one-time lazy reconstruction is added, guarded by storageProperties != null; no hot-path regression is apparent.
  • Observability: No additional logging appears necessary for this narrow replay fix.
  • Test coverage: The new test is relevant to the bug. I attempted ./run-fe-ut.sh --run org.apache.doris.analysis.StorageDescPersistTest, but the runner failed during generated-code setup because thirdparty/installed/bin/protoc could not be executed by make, before the test ran.

User review focus: No additional user-provided review focus was supplied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants