Skip to content

[Enhancement] (FE) Convert intra FE-to-FE calls to HTTPS when enabled#60921

Open
nsivarajan wants to merge 5 commits intoapache:masterfrom
nsivarajan:https-internal-fe-communication2
Open

[Enhancement] (FE) Convert intra FE-to-FE calls to HTTPS when enabled#60921
nsivarajan wants to merge 5 commits intoapache:masterfrom
nsivarajan:https-internal-fe-communication2

Conversation

@nsivarajan
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Currently, Doris provides an enable_https switch to enforce HTTPS connections. In hardened deployments, HTTP is completely disabled by setting http_port = 0. However, intra FE-to-FE communication still relies on HTTP, causing failures in edit log and checkpoint synchronisation when HTTPS is enabled.

This PR enhances HTTPS support by automatically converting intra FE-to-FE communication to HTTPS when enable_https=true, without introducing any new configuration, leveraging mysql_ssl_default_ca_certificate created for https connection. When enable_https=false, behavior remains unchanged. This ensures secure and seamless FE-to-FE communication without breaking existing workflows.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@nsivarajan nsivarajan marked this pull request as ready for review February 28, 2026 19:18
@nsivarajan
Copy link
Contributor Author

run buildall

@nsivarajan
Copy link
Contributor Author

run buildall

@nsivarajan
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28728 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 29f45dbe504c8d46be0de681c14f24d3a97c8291, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17684	4460	4289	4289
q2	q3	10648	764	517	517
q4	4680	362	249	249
q5	7562	1191	1015	1015
q6	178	176	149	149
q7	799	835	682	682
q8	9311	1426	1316	1316
q9	4802	4758	4692	4692
q10	6803	1857	1651	1651
q11	476	269	232	232
q12	718	566	470	470
q13	17774	4199	3412	3412
q14	245	236	212	212
q15	928	805	799	799
q16	770	716	681	681
q17	731	857	419	419
q18	5889	5411	5196	5196
q19	1178	964	601	601
q20	507	492	392	392
q21	4756	1996	1453	1453
q22	390	341	301	301
Total cold run time: 96829 ms
Total hot run time: 28728 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4762	4651	4592	4592
q2	q3	1828	2177	1772	1772
q4	879	1187	794	794
q5	4012	4336	4277	4277
q6	180	186	140	140
q7	1770	1634	1517	1517
q8	2488	2690	2508	2508
q9	7478	7402	7416	7402
q10	2622	2818	2368	2368
q11	533	459	437	437
q12	515	611	473	473
q13	4098	4792	3592	3592
q14	283	302	273	273
q15	868	841	802	802
q16	712	755	714	714
q17	1156	1670	1290	1290
q18	7183	6960	6655	6655
q19	931	1009	930	930
q20	2072	2193	1988	1988
q21	3960	3439	3280	3280
q22	462	432	373	373
Total cold run time: 48792 ms
Total hot run time: 46177 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183423 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 29f45dbe504c8d46be0de681c14f24d3a97c8291, data reload: false

query5	4329	646	528	528
query6	324	213	197	197
query7	4207	462	270	270
query8	334	246	234	234
query9	8698	2764	2751	2751
query10	530	385	340	340
query11	16970	17297	17154	17154
query12	195	147	136	136
query13	1419	479	404	404
query14	6572	3408	3105	3105
query14_1	2910	2895	2898	2895
query15	207	203	176	176
query16	1001	491	473	473
query17	1425	731	640	640
query18	2524	451	356	356
query19	204	203	189	189
query20	150	133	132	132
query21	264	161	119	119
query22	5631	4998	4903	4903
query23	17149	16805	16500	16500
query23_1	16681	16661	16546	16546
query24	7280	1626	1240	1240
query24_1	1259	1257	1216	1216
query25	566	489	426	426
query26	1244	258	162	162
query27	2770	469	289	289
query28	4534	1897	1903	1897
query29	809	578	477	477
query30	318	247	207	207
query31	883	729	656	656
query32	81	73	71	71
query33	525	356	295	295
query34	945	928	564	564
query35	664	695	623	623
query36	1111	1127	929	929
query37	143	95	89	89
query38	2963	2886	2840	2840
query39	884	871	856	856
query39_1	838	836	836	836
query40	240	165	138	138
query41	68	64	63	63
query42	107	102	99	99
query43	376	393	349	349
query44	
query45	206	199	188	188
query46	884	991	612	612
query47	2105	2147	2036	2036
query48	322	345	245	245
query49	642	473	393	393
query50	697	294	220	220
query51	4086	4103	4062	4062
query52	107	111	98	98
query53	291	332	288	288
query54	312	280	301	280
query55	85	88	82	82
query56	349	323	338	323
query57	1375	1344	1250	1250
query58	297	323	271	271
query59	2554	2686	2487	2487
query60	320	336	315	315
query61	145	146	144	144
query62	631	582	547	547
query63	309	282	273	273
query64	4881	1251	965	965
query65	
query66	1453	459	351	351
query67	16657	16396	16348	16348
query68	
query69	394	306	282	282
query70	969	950	979	950
query71	343	306	291	291
query72	2797	2649	2430	2430
query73	543	548	318	318
query74	9998	9947	9763	9763
query75	2842	2746	2429	2429
query76	2290	1024	676	676
query77	366	368	304	304
query78	11093	11278	10626	10626
query79	2353	796	602	602
query80	1755	628	543	543
query81	548	274	242	242
query82	1015	146	114	114
query83	322	283	244	244
query84	268	111	101	101
query85	883	474	427	427
query86	411	299	303	299
query87	3097	3115	2953	2953
query88	3552	2683	2677	2677
query89	419	366	340	340
query90	1998	172	169	169
query91	164	159	137	137
query92	77	75	67	67
query93	991	846	507	507
query94	624	307	299	299
query95	585	403	305	305
query96	641	511	229	229
query97	2479	2481	2422	2422
query98	226	221	217	217
query99	1006	1015	923	923
Total cold run time: 255369 ms
Total hot run time: 183423 ms

@nsivarajan
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 29084 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 29f45dbe504c8d46be0de681c14f24d3a97c8291, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17606	4404	4317	4317
q2	q3	10651	792	533	533
q4	4675	369	255	255
q5	7553	1220	1009	1009
q6	176	178	149	149
q7	784	839	678	678
q8	9743	1467	1330	1330
q9	5243	4798	4744	4744
q10	6825	1896	1648	1648
q11	452	255	260	255
q12	763	576	482	482
q13	17767	4242	3421	3421
q14	243	232	213	213
q15	957	794	793	793
q16	774	739	680	680
q17	730	889	412	412
q18	5983	5317	5312	5312
q19	1466	982	618	618
q20	526	511	413	413
q21	4681	1982	1556	1556
q22	408	337	266	266
Total cold run time: 98006 ms
Total hot run time: 29084 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4725	4574	4697	4574
q2	q3	1819	2229	1779	1779
q4	861	1189	785	785
q5	4055	4394	4364	4364
q6	183	172	139	139
q7	1757	1629	1523	1523
q8	2490	2708	2625	2625
q9	7429	7430	7381	7381
q10	2630	2833	2400	2400
q11	526	433	408	408
q12	507	613	465	465
q13	4042	4584	3667	3667
q14	339	305	277	277
q15	835	799	810	799
q16	719	778	737	737
q17	1175	1564	1392	1392
q18	7295	6767	6768	6767
q19	940	1081	1007	1007
q20	2088	2196	2008	2008
q21	4092	3511	3473	3473
q22	531	448	419	419
Total cold run time: 49038 ms
Total hot run time: 46989 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184255 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 29f45dbe504c8d46be0de681c14f24d3a97c8291, data reload: false

query5	4327	636	512	512
query6	321	222	211	211
query7	4233	489	282	282
query8	359	255	235	235
query9	9079	2842	2836	2836
query10	512	428	363	363
query11	17107	17503	17564	17503
query12	201	132	129	129
query13	1269	491	360	360
query14	7404	3267	3012	3012
query14_1	2935	2969	2900	2900
query15	216	192	190	190
query16	1026	407	478	407
query17	1179	782	661	661
query18	2873	495	375	375
query19	215	229	191	191
query20	144	132	136	132
query21	217	136	121	121
query22	5997	5318	4970	4970
query23	17227	16838	16732	16732
query23_1	16798	16803	16807	16803
query24	7063	1635	1224	1224
query24_1	1249	1252	1233	1233
query25	547	481	428	428
query26	821	281	160	160
query27	2719	484	280	280
query28	4377	1860	1856	1856
query29	783	562	471	471
query30	311	235	208	208
query31	855	754	618	618
query32	81	71	68	68
query33	516	325	279	279
query34	883	915	572	572
query35	623	672	586	586
query36	1090	1148	972	972
query37	133	93	82	82
query38	3000	2907	2877	2877
query39	899	871	836	836
query39_1	825	836	832	832
query40	223	149	135	135
query41	66	59	58	58
query42	104	104	101	101
query43	369	391	353	353
query44	
query45	198	192	182	182
query46	924	986	603	603
query47	2161	2183	2067	2067
query48	317	312	236	236
query49	615	481	372	372
query50	679	276	215	215
query51	4076	4133	4008	4008
query52	108	108	97	97
query53	290	338	300	300
query54	282	267	259	259
query55	88	83	80	80
query56	343	287	305	287
query57	1359	1371	1283	1283
query58	288	277	284	277
query59	2608	2682	2503	2503
query60	335	330	314	314
query61	147	145	144	144
query62	641	591	550	550
query63	307	277	284	277
query64	4163	1255	1002	1002
query65	
query66	1352	450	358	358
query67	16237	16225	16247	16225
query68	
query69	393	312	281	281
query70	1006	962	963	962
query71	335	318	285	285
query72	2771	2678	2430	2430
query73	527	551	317	317
query74	10015	9921	9765	9765
query75	2854	2733	2455	2455
query76	2322	1029	696	696
query77	355	377	308	308
query78	11162	11329	10664	10664
query79	2678	791	605	605
query80	1780	621	576	576
query81	567	270	244	244
query82	1015	148	119	119
query83	348	260	243	243
query84	251	114	101	101
query85	870	485	439	439
query86	415	300	295	295
query87	3104	3101	2988	2988
query88	3510	2663	2656	2656
query89	427	368	344	344
query90	1951	167	168	167
query91	162	153	141	141
query92	74	77	69	69
query93	1090	823	507	507
query94	628	321	296	296
query95	594	395	315	315
query96	643	505	229	229
query97	2462	2480	2455	2455
query98	227	233	218	218
query99	1048	947	916	916
Total cold run time: 255584 ms
Total hot run time: 184255 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 8.33% (6/72) 🎉
Increment coverage report
Complete coverage report

@Hastyshell
Copy link
Collaborator

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR converts intra FE-to-FE HTTP communication to HTTPS when enable_https=true. The core approach is sound — centralizing URL construction in HttpURLUtil.buildInternalFeUrl() and adding SSL-aware HTTP clients in InternalHttpsUtils. However, I found several issues that should be addressed:

Critical Issues

  1. JVM-wide global state mutation (installTrustManagerForUrlConnection)setDefaultSSLSocketFactory() and setDefaultHostnameVerifier() modify JVM-wide defaults that affect ALL HttpsURLConnection instances in the process, not just FE-to-FE connections. This is called on every getConnectionWithNodeIdent() invocation, which is wasteful and thread-unsafe. Additionally, the hostname verifier (hostname, session) -> true completely disables hostname verification globally, which could affect other HTTPS connections in the JVM (e.g., external catalog connections).

  2. Resource leak in HttpUtils.executeRequest()CloseableHttpClient is created but never closed. The ResponseHandler variant of execute() closes the HTTP response, but NOT the client itself. This leaks connection pool resources. Other callers in the codebase (e.g., MysqlLoadManager, HFUtils) properly use try-with-resources.

  3. Incomplete conversion — Several FE-to-FE communication paths still hardcode http:// and Config.http_port, which will break when enable_https=true and http_port=0. Key missed locations:

    • HttpUtils.concatUrl() (line 75) — hardcodes http://
    • HttpUtils.getFeList() (line 65) — hardcodes Config.http_port
    • SessionController.getOtherSessionInfo() — hardcodes http://
    • NodeAction.concatFeSetConfigUrl() — hardcodes http://
    • NodeAction.getFeList() and handleConfigurationInfo() — hardcodes http:// and Config.http_port

Minor Issues

  1. Removed useful comments — Two comments about using Env.getServingEnv() instead of getCurrentEnv() were deleted. These documented a non-obvious requirement and should be preserved.

  2. Removed LOG.info in MetaHelper.doGet — The debug logging for URL, timeout, and headers was removed. This is useful for troubleshooting connectivity issues.

Checkpoints

  • Correctness: Issues found (global state mutation, incomplete conversion, resource leak)
  • Security: The approach of certificate validation + disabled hostname verification for internal communication is reasonable and documented, but the JVM-wide global mutation is a concern
  • Performance: installTrustManagerForUrlConnection() re-reads the truststore from disk and re-initializes the SSLContext on every call — should be cached/initialized once
  • Tests: Unit tests for buildInternalFeUrl() are good; no tests for the SSL client creation or the HTTPS integration path

sslContext.init(null, tmf.getTrustManagers(), null);

javax.net.ssl.HttpsURLConnection.setDefaultSSLSocketFactory(sslContext.getSocketFactory());
javax.net.ssl.HttpsURLConnection.setDefaultHostnameVerifier((hostname, session) -> true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: JVM-wide global state mutation

setDefaultSSLSocketFactory() and setDefaultHostnameVerifier() modify JVM-wide defaults that affect ALL HttpsURLConnection instances in the entire process — not just the FE-to-FE connection being created. This is problematic because:

  1. Thread safety: Multiple threads calling this concurrently could race. There's no synchronization.
  2. Global hostname verification bypass: (hostname, session) -> true disables hostname verification for ALL future HttpsURLConnection instances, including any external catalog connections or other HTTPS clients using HttpsURLConnection.
  3. Redundant re-initialization: This is called on every getConnectionWithNodeIdent() invocation, re-reading the truststore from disk and re-initializing the SSLContext each time.

Recommendation: Initialize the SSLContext and install the defaults once at startup (e.g., via a static initializer or an init() method called during FE bootstrap). Alternatively, set the SSL factory on the individual HttpsURLConnection instance rather than the global default:

HttpsURLConnection httpsConn = (HttpsURLConnection) url.openConnection();
httpsConn.setSSLSocketFactory(sslContext.getSocketFactory());
httpsConn.setHostnameVerifier(...);

This avoids polluting the JVM-wide defaults.


private static String executeRequest(HttpRequestBase request) throws IOException {
CloseableHttpClient client = getHttpClient();
CloseableHttpClient client;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Resource leak — CloseableHttpClient is never closed

The CloseableHttpClient created here (both the HTTPS and plain HTTP paths) is never closed. The ResponseHandler variant of execute() closes the HTTP response, but NOT the client itself. This leaks the client's internal PoolingHttpClientConnectionManager and associated socket file descriptors.

Other callers in the codebase do this correctly (e.g., MysqlLoadManager.java uses try-with-resources).

Fix:

private static String executeRequest(HttpRequestBase request) throws IOException {
    try (CloseableHttpClient client = request.getURI().getScheme().equalsIgnoreCase("https")
            && Config.enable_https
            ? InternalHttpsUtils.createValidatedHttpClient()
            : HttpClientBuilder.create().build()) {
        return client.execute(request, httpResponse -> EntityUtils.toString(httpResponse.getEntity()));
    }
}

Note: this resource leak also existed before this PR (the old code called getHttpClient() without closing it), but since this PR rewrites the method, it's a good opportunity to fix it.

// because here we need to obtain selfNode through the official service catalog.
HostInfo selfNode = Env.getServingEnv().getSelfNode();
conn.setRequestProperty(Env.CLIENT_NODE_HOST_KEY, selfNode.getHost());
conn.setRequestProperty(Env.CLIENT_NODE_PORT_KEY, selfNode.getPort() + "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted useful comments

The comments // Must use Env.getServingEnv() instead of getCurrentEnv(), // because here we need to obtain selfNode through the official service catalog. were removed here and in getNodeIdentHeaders(). These documented a non-obvious architectural requirement. Please preserve them — they explain why getServingEnv() is used instead of the more common getCurrentEnv().

import java.util.Map;

public class MetaHelper {
public static final Logger LOG = LogManager.getLogger(MetaHelper.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed useful debug logging

The LOG.info("meta helper, url: {}, timeout{}, headers: {}", url, timeout, headers) was removed. While the response wrapping in try-catch is a good improvement, the debug logging for URL, timeout, and headers is valuable for troubleshooting connectivity issues (especially with this HTTPS change). Consider keeping it, perhaps at LOG.debug level.

private static final Logger LOG = LogManager.getLogger(InternalHttpsUtils.class);

public static CloseableHttpClient createValidatedHttpClient() {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: Truststore re-read on every call

Both createValidatedHttpClient() and installTrustManagerForUrlConnection() re-read the truststore file from disk and re-initialize the SSLContext on every invocation. For internal FE-to-FE calls (checkpoint sync, journal ID queries, etc.), this could be called frequently.

Consider caching the SSLContext in a static field, initialized lazily or at startup. The truststore file doesn't change at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants