Add levenshtein and hamming_distance functions by KazeBox33 · Pull Request #60412 · apache/doris

KazeBox33 · 2026-02-01T16:11:29Z

What problem does this PR solve?

Related Issue: #48203
Related PR: #57144 (reference)
Problem Summary: support levenshtein (Hive) and hamming_distance (Trino/Presto).

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
  - ./run-regression-test.sh --run -d regression-test/suites/query_p0/sql_functions/string_functions -s string_functions_all
  - ./run-regression-test.sh --run -d regression-test/suites/nereids_p0/sql_functions/string_functions -s test_string_function
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes. Add document PR link here: [doc] add docs for levenshtein and hamming_distance doris-website#3324

Thearas · 2026-02-01T16:11:34Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

This PR adds support for two string distance functions: levenshtein (Hive compatibility) and hamming_distance (Trino/Presto compatibility). These functions compute edit distance and character-difference distance between strings respectively, with proper UTF-8 character handling.

Changes:

Implemented levenshtein function that calculates the minimum number of single-character edits needed to transform one string into another
Implemented hamming_distance function that counts character differences between equal-length strings
Added comprehensive test coverage for both functions including edge cases, NULL handling, and UTF-8 character support

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test_string_all.groovy	Added regression tests for levenshtein and hamming_distance functions with various input scenarios
test_string_function.groovy	Added nereids query tests covering edge cases and error conditions for both functions
test_string_all.out	Expected output for levenshtein and hamming_distance test cases
test_string_function.out	Expected output for nereids function tests
ScalarFunctionVisitor.java	Added visitor methods for HammingDistance and Levenshtein function nodes
Levenshtein.java	Implemented Levenshtein scalar function class with signature definitions
HammingDistance.java	Implemented HammingDistance scalar function class with signature definitions
StringArithmetic.java	Added executable implementations for levenshtein and hamming_distance with dynamic programming algorithm
BuiltinScalarFunctions.java	Registered new functions in the builtin scalar functions registry
simple_function_factory.h	Added function registration declarations for backend implementations
function_string.cpp	Minor whitespace cleanup
function_levenshtein.cpp	Backend C++ implementation of levenshtein with UTF-8 support
function_hamming_distance.cpp	Backend C++ implementation of hamming_distance with UTF-8 support and length validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T11:55:18Z

regression-test/suites/query_p0/sql_functions/string_functions/test_string_all.groovy

+    qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
+    testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")


Test identifier 'qt_hamming_distance_3331' has inconsistent numbering (3331 instead of 333). Should be 'qt_hamming_distance_333' to match the pattern used for other functions like 'qt_space_333'.

Suggested change

qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"

testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")

qt_hamming_distance_333 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"

testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"

Copilot · 2026-02-02T11:55:18Z

regression-test/suites/query_p0/sql_functions/string_functions/test_string_all.groovy

+    // HAMMING_DISTANCE tests
+    qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
+    testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")
+    qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"


Test identifier 'qt_hamming_distance_3332' has inconsistent numbering (3332 instead of 334). Should be 'qt_hamming_distance_334' to follow the sequential pattern.

Suggested change

qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"

qt_hamming_distance_334 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"

zclllyybb · 2026-02-02T11:45:07Z

be/src/vec/functions/function_hamming_distance.cpp

+        auto& res_data = res_column->get_data();
+
+        for (size_t i = 0; i < input_rows_count; ++i) {
+            const StringRef left = left_col->get_data_at(i);


this is virtual function call. cast before you use it.

zclllyybb · 2026-02-02T11:47:50Z

be/src/exprs/function/function_hamming_distance.cpp

+                                   size_t row) {
+        std::vector<size_t> left_offsets;
+        std::vector<size_t> right_offsets;
+        utf8_char_offsets(left, left_offsets);


you can check if it's all ASCII first. if true, no need to do heavy work for utf8.

have used simd::VStringFunctions::is_ascii

zclllyybb · 2026-02-02T11:48:25Z

be/src/vec/functions/function_levenshtein.cpp

+namespace doris::vectorized {
+#include "common/compile_check_begin.h"
+
+class FunctionLevenshtein : public IFunction {


these two functions could use a same base template. maybe we already have one.

zclllyybb · 2026-02-02T11:52:41Z

be/src/vec/functions/function_hamming_distance.cpp

+        const size_t len = left_offsets.size();
+        for (size_t i = 0; i < len; ++i) {
+            size_t left_off = left_offsets[i];
+            size_t left_next = (i + 1 < len) ? left_offsets[i + 1] : left.size;


make it branchless here. is it meaningful for strings with different length?

be/src/exprs/function/function_hamming_distance.cpp

zclllyybb · 2026-02-02T12:01:52Z

and remember to format your code, add more cases which query from table

be/src/vec/functions/function_levenshtein.cpp

…ToType Refactor two string distance functions from custom IFunction classes to FunctionBinaryToType-based implementations. Keep existing ASCII fast path, UTF-8 behavior and hamming length validation semantics. This addresses review feedback about using shared binary-function template style.

…ryToType and extend regression cases Refactor BE implementations to FunctionBinaryToType template style and keep UTF-8/ASCII semantics. Optimize hamming_distance UTF-8 loop branch handling. Add and align query_p0 + nereids_p0 constant/table test cases and expected outputs.

…unction regression cases

…ramework

zclllyybb · 2026-03-14T15:28:46Z

run buildall

zclllyybb · 2026-03-14T15:28:49Z

/review

github-actions

Code Review Summary for PR #60412: Add levenshtein and hamming_distance functions

Overall this is a well-structured PR that adds two useful string distance functions. The FE-BE type consistency is correct, the constant folding implementations match the BE implementations, and the test coverage is reasonable. Below are the issues found, organized by the critical checkpoint conclusions.

Critical Checkpoint Conclusions

1. Goal and correctness: The PR adds levenshtein (returns Int32) and hamming_distance (returns Int64) string distance functions with UTF-8 support. The core algorithms are correct. FE constant folding, FE scalar function classes, BE implementations, and registration are all present.

2. Modification scope: The change is focused - 13 files, all related to the two new functions. Minor cosmetic changes in function_string.cpp (blank line changes) are unnecessary but harmless.

3. Concurrency: No concurrency concerns - these are pure stateless scalar functions.

4. Lifecycle management: No special lifecycle concerns.

5. Configuration items: None added. N/A.

6. Incompatible changes: None. New functions only.

7. Parallel code paths: hamming_distance uses a custom FunctionBinaryStringToTypeWithNull wrapper class instead of the standard FunctionBinaryToType used by levenshtein. This is necessary because hamming_distance can return Status::InvalidArgument for unequal-length strings, and the custom class provides proper row-level null handling with explicit nullable return type control. This is an acceptable design choice.

8. Special conditional checks: The string_ref_at helper in both files contains defensive if checks that violate AGENTS.md coding standards. See inline comments.

9. Test coverage: Good regression test coverage for both functions with ASCII, UTF-8, NULL, empty string, and table-based tests. Error case for hamming_distance is tested. However, there is an orphaned .out entry. See inline comments.

10. Observability: N/A for pure scalar functions.

11. Transaction/persistence: N/A.

12. FE-BE variable passing: FE-BE types are consistent (levenshtein: Int32/Int32, hamming_distance: Int64/Int64). PropagateNullable in FE aligns with BE null handling for both functions.

13. Performance: The levenshtein function has O(m*n) time complexity with no input size limit, which is a potential DoS concern for very large strings. See inline comments.

14. Other issues: Alias levenshtein_distance (common Hive name) is not registered - this may be intentional but worth considering for compatibility.

github-actions · 2026-03-14T15:39:17Z

be/src/exprs/function/function_hamming_distance.cpp

+        DCHECK_LT(i, offsets.size());
+        const size_t begin = (i == 0) ? 0 : offsets[i - 1];
+        const size_t end = offsets[i];
+        if (end <= begin || end > data.size()) {


Coding standards violation (AGENTS.md): This is defensive programming. In a well-formed ColumnString, end > begin always holds (every string has at least a \0 terminator) and end <= data.size() is a structural invariant. Per AGENTS.md: "Assert correctness only—never use defensive programming with if or similar constructs."

Also, the (i == 0) ? 0 : offsets[i - 1] ternary on line 83 is unnecessary — PaddedPODArray guarantees offsets[-1] == 0 (see column_string.h comment: "-1th index is Ok"). All standard code in the codebase simply uses offsets[i - 1].

Recommendation: Replace this entire string_ref_at helper with the standard pattern:

const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]); size_t str_size = offsets[i] - offsets[i - 1];

Or use column->get_data_at(i) when the column object is available (as done in the nullable path at line 252).

github-actions · 2026-03-14T15:39:17Z

be/src/exprs/function/function_hamming_distance.cpp

+        }
+
+        size_t str_size = end - begin;
+        if (str_size > 0 && data[end - 1] == '\0') {


Non-standard pattern: Stripping the trailing \0 is not part of the standard ColumnString access pattern. ColumnString::get_data_at() returns the size including the terminating zero (see size_at() comment: "Size of i-th element, including terminating zero"). Other string functions in the codebase do not strip \0 from raw offset-based access either — they rely on the StringRef.size being computed as offsets[i] - offsets[i-1], which already accounts for the terminator.

Note that on line 252, get_data_at() is used (which includes the \0 in the size), creating an inconsistency within this same file between how string data is accessed in the non-nullable vs nullable paths.

This inconsistency could lead to subtle off-by-one differences: the non-nullable path (using string_ref_at) strips \0 and compares character content only, while the nullable path (using get_data_at) includes \0 in the comparison. For hamming_distance, this would mean different results depending on whether the input column is nullable or not.

github-actions · 2026-03-14T15:39:17Z

be/src/exprs/function/function_levenshtein.cpp

+        DCHECK_LT(i, offsets.size());
+        const size_t begin = (i == 0) ? 0 : offsets[i - 1];
+        const size_t end = offsets[i];
+        if (end <= begin || end > data.size()) {


Same issue as in function_hamming_distance.cpp: defensive if check and non-standard (i == 0) ? 0 : offsets[i - 1] pattern. See the comment on the hamming_distance file for details.

Additionally, FunctionBinaryToType (which FunctionLevenshtein uses) employs use_default_implementation_for_nulls() = true, so the framework strips nullable wrappers and calls vector_vector/vector_scalar/scalar_vector with non-nullable columns. These methods use string_ref_at, but the framework also uses get_data_at() in some paths — maintaining consistency with the standard access pattern would be safer.

github-actions · 2026-03-14T15:39:17Z

be/src/exprs/function/function_levenshtein.cpp

+               std::memcmp(left.data + left_off, right.data + right_off, left_len) == 0;
+    }
+
+    static Int32 levenshtein_distance_ascii(const StringRef& left, const StringRef& right) {


Performance concern — no input size limit: levenshtein_distance is O(m*n) in both time and performs per-row std::vector allocations. Two 1MB strings would require ~10^12 operations. There is no guard against extremely long inputs.

Other functions in the codebase (e.g., functions_multi_string_search.cpp) have explicit size limits for expensive operations. Consider adding a reasonable character-count limit (e.g., 10,000 or 65,535 characters) and returning Status::InvalidArgument when exceeded. This would require changing the return type to Status (similar to hamming_distance).

This is a moderate concern — practical usage rarely involves very long strings, but it's a potential DoS vector via crafted SQL.

github-actions · 2026-03-14T15:39:17Z

regression-test/data/nereids_p0/sql_functions/string_functions/test_string_function.out

+6	1
+7	2
+
+-- !hamming_distance --


Orphaned output entry: This -- !hamming_distance -- block with \N has no corresponding qt_hamming_distance query in the .groovy test file. After the qt_hamming_distance_tbl query (the table test), the groovy file proceeds to qt_soundex for non-ASCII testing — there is no additional qt_hamming_distance call.

This orphaned entry will cause the regression test to fail because the .out file has more output blocks than the .groovy file produces. The .out file should be auto-generated by running the test, not hand-edited. Please regenerate this file.

zclllyybb · 2026-03-14T15:58:57Z

be/src/exprs/function/function_hamming_distance.cpp

+};
+
+template <template <typename, typename> typename Impl, typename Name>
+class FunctionBinaryStringToTypeWithNull : public IFunction {


why we need this new template? why not FunctionBinaryToType?

Because hamming_distance needs different behavior from FunctionBinaryToType: return NULL when any argument is NULL, but still return InvalidArgument for non-NULL unequal-length inputs. I used this wrapper to keep that row-level NULL handling explicit.
if I do not use this to wrap , hamming_distance("abc",NULL) will return InvalidArgument .

doris-robot · 2026-03-14T16:06:15Z

TPC-H: Total hot run time: 26861 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 68d4ca6033fa03d84cbcccd04c127de1e003f48d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17670	4442	4323	4323
q2	q3	10637	822	524	524
q4	4681	369	251	251
q5	7542	1202	1017	1017
q6	174	172	148	148
q7	797	859	671	671
q8	9301	1476	1343	1343
q9	4919	4808	4726	4726
q10	6232	1904	1664	1664
q11	471	261	261	261
q12	684	579	470	470
q13	18056	2918	2179	2179
q14	233	237	221	221
q15	q16	729	724	672	672
q17	713	842	430	430
q18	6011	5356	5255	5255
q19	1141	1003	609	609
q20	541	493	383	383
q21	4513	1817	1388	1388
q22	348	490	326	326
Total cold run time: 95393 ms
Total hot run time: 26861 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4792	4641	4583	4583
q2	q3	3871	4331	3812	3812
q4	863	1186	777	777
q5	4115	4369	4312	4312
q6	185	176	145	145
q7	1738	1632	1495	1495
q8	2509	2729	2519	2519
q9	7675	7414	7417	7414
q10	3772	3972	3581	3581
q11	506	431	428	428
q12	494	597	436	436
q13	2634	3139	2251	2251
q14	284	286	272	272
q15	q16	739	773	700	700
q17	1158	1407	1400	1400
q18	7210	6738	6576	6576
q19	872	863	921	863
q20	2016	2118	2050	2050
q21	3971	3477	3859	3477
q22	511	412	386	386
Total cold run time: 49915 ms
Total hot run time: 47477 ms

doris-robot · 2026-03-14T16:16:59Z

TPC-DS: Total hot run time: 167910 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 68d4ca6033fa03d84cbcccd04c127de1e003f48d, data reload: false

query5	4363	650	523	523
query6	329	221	210	210
query7	4235	497	267	267
query8	344	257	239	239
query9	8703	2709	2731	2709
query10	504	387	333	333
query11	7000	5109	4885	4885
query12	176	126	124	124
query13	1248	457	337	337
query14	5776	3735	3439	3439
query14_1	2787	2795	2816	2795
query15	203	196	175	175
query16	1000	486	458	458
query17	1120	717	615	615
query18	2445	463	350	350
query19	222	211	186	186
query20	131	127	127	127
query21	214	132	110	110
query22	13223	14056	14624	14056
query23	16423	15859	15720	15720
query23_1	15647	15541	15331	15331
query24	7386	1636	1236	1236
query24_1	1239	1239	1239	1239
query25	559	483	443	443
query26	1231	280	150	150
query27	2754	495	298	298
query28	4475	1854	1837	1837
query29	886	590	502	502
query30	305	232	199	199
query31	1027	936	874	874
query32	88	72	74	72
query33	523	343	300	300
query34	899	926	521	521
query35	654	675	600	600
query36	1079	1099	964	964
query37	142	98	89	89
query38	2934	2943	2863	2863
query39	852	848	848	848
query39_1	811	804	793	793
query40	230	150	136	136
query41	62	57	58	57
query42	271	264	257	257
query43	252	245	215	215
query44	
query45	200	184	178	178
query46	916	1008	611	611
query47	2094	2139	2048	2048
query48	318	321	229	229
query49	623	460	385	385
query50	705	281	211	211
query51	4091	4089	4028	4028
query52	253	271	252	252
query53	302	334	281	281
query54	285	273	252	252
query55	90	85	78	78
query56	322	317	306	306
query57	1728	1787	1813	1787
query58	279	275	267	267
query59	2819	2962	2744	2744
query60	339	333	323	323
query61	154	143	150	143
query62	620	581	542	542
query63	320	280	273	273
query64	5088	1266	975	975
query65	
query66	1454	452	354	354
query67	24354	24429	24204	24204
query68	
query69	425	317	289	289
query70	936	890	955	890
query71	343	303	297	297
query72	2779	2679	2158	2158
query73	551	564	323	323
query74	9633	9585	9402	9402
query75	2912	2752	2468	2468
query76	2273	1083	702	702
query77	376	399	311	311
query78	10896	11058	10435	10435
query79	1152	806	557	557
query80	903	636	542	542
query81	520	255	227	227
query82	1357	146	120	120
query83	344	276	243	243
query84	251	119	99	99
query85	900	494	431	431
query86	417	340	289	289
query87	3132	3144	3027	3027
query88	3589	2675	2666	2666
query89	444	377	343	343
query90	1840	177	175	175
query91	168	169	133	133
query92	75	73	74	73
query93	939	882	503	503
query94	513	321	305	305
query95	579	410	315	315
query96	641	541	235	235
query97	2476	2480	2407	2407
query98	234	221	232	221
query99	1025	984	936	936
Total cold run time: 249265 ms
Total hot run time: 167910 ms

hello-stephen · 2026-03-14T16:41:52Z

FE UT Coverage Report

Increment line coverage 3.03% (2/66) 🎉
Increment coverage report
Complete coverage report

doris-robot · 2026-03-14T17:14:46Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.65% (19733/37482)
Line Coverage	36.25% (184414/508725)
Region Coverage	32.39% (142433/439808)
Branch Coverage	33.55% (62204/185420)

hello-stephen · 2026-03-14T18:34:20Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.64% (26281/36687)
Line Coverage	54.41% (275861/507003)
Region Coverage	51.53% (228722/443856)
Branch Coverage	53.06% (98633/185902)

linrrzqqq · 2026-03-14T21:22:01Z

be/src/exprs/function/function_hamming_distance.cpp

+        std::vector<size_t> left_offsets;
+        std::vector<size_t> right_offsets;
+        utf8_char_offsets(left, left_offsets);
+        utf8_char_offsets(right, right_offsets);


get const_col's offsets(vec_scalar/scalar_vec) once is more expected, and also put the vector offset outer as well, and reserve based on the max length per line

levenshtein and hamming_distance have done

linrrzqqq · 2026-03-14T21:23:11Z

be/src/exprs/function/function_levenshtein.cpp

+
+        std::vector<size_t> left_offsets;
+        std::vector<size_t> right_offsets;
+        utf8_char_offsets(left, left_offsets);


linrrzqqq · 2026-03-14T21:30:09Z

be/src/exprs/function/function_levenshtein.cpp

+                const Int32 insert_cost = curr[j - 1] + 1;
+                const Int32 delete_cost = prev[j] + 1;
+                const Int32 replace_cost = prev[j - 1] + cost;
+                curr[j] = std::min({insert_cost, delete_cost, replace_cost});


use min(min(a, b), c) to avoid temporary object construction

KazeBox33 · 2026-03-15T11:16:48Z

run buildall

doris-robot · 2026-03-15T11:54:18Z

TPC-H: Total hot run time: 26915 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7f5e716f67e18a0cec437e795b6bfcdfc7bdb6e0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17660	4447	4272	4272
q2	q3	10642	798	517	517
q4	4711	363	255	255
q5	7826	1203	1034	1034
q6	198	180	148	148
q7	812	836	668	668
q8	10402	1481	1365	1365
q9	5566	4773	4792	4773
q10	6354	1940	1632	1632
q11	492	266	232	232
q12	738	570	468	468
q13	18036	2925	2171	2171
q14	230	227	226	226
q15	q16	736	723	663	663
q17	733	846	447	447
q18	5872	5351	5240	5240
q19	1336	983	600	600
q20	541	486	381	381
q21	4575	2019	1514	1514
q22	367	347	309	309
Total cold run time: 97827 ms
Total hot run time: 26915 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4754	4648	4531	4531
q2	q3	3894	4342	3803	3803
q4	875	1206	790	790
q5	4077	4357	4396	4357
q6	192	179	149	149
q7	1797	1630	1517	1517
q8	2502	2752	2563	2563
q9	7501	7499	7425	7425
q10	3700	3912	3672	3672
q11	514	486	406	406
q12	474	615	443	443
q13	2842	3453	2335	2335
q14	283	305	285	285
q15	q16	743	768	755	755
q17	1219	1387	1426	1387
q18	7274	6761	6725	6725
q19	952	878	852	852
q20	2112	2175	2035	2035
q21	4189	3490	3364	3364
q22	487	416	388	388
Total cold run time: 50381 ms
Total hot run time: 47782 ms

doris-robot · 2026-03-15T12:05:05Z

TPC-DS: Total hot run time: 168305 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7f5e716f67e18a0cec437e795b6bfcdfc7bdb6e0, data reload: false

query5	4337	629	519	519
query6	327	225	219	219
query7	4207	473	273	273
query8	347	240	226	226
query9	8747	2729	2727	2727
query10	504	407	356	356
query11	6988	5045	4857	4857
query12	187	128	124	124
query13	1261	443	355	355
query14	5700	3705	3443	3443
query14_1	2838	2815	2799	2799
query15	210	198	179	179
query16	978	455	444	444
query17	894	717	631	631
query18	2448	446	350	350
query19	221	217	182	182
query20	134	132	129	129
query21	212	135	112	112
query22	13242	14310	14608	14310
query23	16250	15646	15589	15589
query23_1	15651	15784	15509	15509
query24	7133	1611	1194	1194
query24_1	1214	1218	1234	1218
query25	537	478	418	418
query26	1271	265	153	153
query27	2784	476	287	287
query28	4526	1854	1842	1842
query29	822	557	475	475
query30	294	222	186	186
query31	1012	964	882	882
query32	83	72	70	70
query33	507	338	274	274
query34	951	881	529	529
query35	637	667	596	596
query36	1112	1147	1018	1018
query37	130	91	82	82
query38	2964	2885	2845	2845
query39	861	823	802	802
query39_1	801	797	789	789
query40	229	153	137	137
query41	62	62	60	60
query42	251	250	257	250
query43	245	248	228	228
query44	
query45	194	241	191	191
query46	886	998	614	614
query47	2126	2156	2070	2070
query48	322	311	223	223
query49	623	481	397	397
query50	671	280	209	209
query51	4069	4061	3998	3998
query52	261	264	256	256
query53	296	338	280	280
query54	298	276	257	257
query55	99	89	79	79
query56	317	330	309	309
query57	1940	1776	1619	1619
query58	285	271	263	263
query59	2767	2926	2758	2758
query60	337	336	320	320
query61	154	156	155	155
query62	628	578	532	532
query63	307	277	273	273
query64	5047	1275	984	984
query65	
query66	1472	448	365	365
query67	24133	24151	24116	24116
query68	
query69	396	312	278	278
query70	974	971	951	951
query71	337	307	296	296
query72	2912	2875	2595	2595
query73	543	550	323	323
query74	9594	9556	9385	9385
query75	2862	2739	2522	2522
query76	2296	1038	664	664
query77	371	432	341	341
query78	10907	11000	10429	10429
query79	2931	797	580	580
query80	1751	628	538	538
query81	576	259	226	226
query82	993	148	118	118
query83	337	265	244	244
query84	301	121	98	98
query85	892	470	436	436
query86	501	291	292	291
query87	3131	3063	3061	3061
query88	3534	2623	2644	2623
query89	425	366	342	342
query90	2237	173	166	166
query91	165	160	143	143
query92	88	76	69	69
query93	2433	838	497	497
query94	650	313	300	300
query95	585	337	315	315
query96	642	513	238	238
query97	2487	2518	2412	2412
query98	239	222	219	219
query99	1032	995	905	905
Total cold run time: 252881 ms
Total hot run time: 168305 ms

hello-stephen · 2026-03-15T12:23:34Z

FE UT Coverage Report

Increment line coverage 3.03% (2/66) 🎉
Increment coverage report
Complete coverage report

doris-robot · 2026-03-15T12:54:32Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.63% (19726/37479)
Line Coverage	36.21% (184217/508686)
Region Coverage	32.35% (142166/439444)
Branch Coverage	33.55% (62171/185335)

hello-stephen · 2026-03-15T15:33:16Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.59% (26263/36685)
Line Coverage	54.36% (275609/506973)
Region Coverage	51.69% (229228/443493)
Branch Coverage	53.05% (98567/185817)

KazeBox33 requested a review from zclllyybb as a code owner February 1, 2026 16:11

zclllyybb self-assigned this Feb 1, 2026

zclllyybb requested a review from Copilot February 2, 2026 11:44

Copilot started reviewing on behalf of zclllyybb February 2, 2026 11:45 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

zclllyybb reviewed Feb 2, 2026

View reviewed changes

linrrzqqq reviewed Feb 3, 2026

View reviewed changes

be/src/vec/functions/function_levenshtein.cpp Outdated Show resolved Hide resolved

KazeBox33 requested a review from zclllyybb March 2, 2026 16:25

KazeBox33 added 12 commits March 13, 2026 23:21

Add levenshtein function

3de73d7

Add hamming_distance function

565f30b

Move hamming_distance to standalone BE function file

f6fa42d

chore: remove unused include

da2f676

Address review comments for levenshtein/hamming_distance tests

68382b6

[fix] harden string distance helpers for bounds safety

f8c43af

[fix] align string_ref_at with ColumnString offset semantics

66f0ed3

[fix](vec): align hamming_distance null semantics and update string f…

160e1e8

…unction regression cases

fix(vec): refine levenshtein and hamming_distance implementation

2cfa81c

refactor(exprs): adapt string distance functions to latest function f…

5ccb369

…ramework

KazeBox33 force-pushed the feature/levenshtein-hamming branch from fb0c27f to 5ccb369 Compare March 14, 2026 12:31

style(exprs): fix include order for clang-format

68d4ca6

github-actions bot reviewed Mar 14, 2026

View reviewed changes

zclllyybb reviewed Mar 14, 2026

View reviewed changes

linrrzqqq reviewed Mar 14, 2026

View reviewed changes

Merge branch 'master' into feature/levenshtein-hamming

b82cefb

KazeBox33 marked this pull request as draft March 15, 2026 09:07

fix(exprs): refine string distance implementation and tests

7f5e716

KazeBox33 marked this pull request as ready for review March 15, 2026 11:16

		qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
		testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")

	qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"
	qt_hamming_distance_334 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"

Conversation

KazeBox33 commented Feb 1, 2026

Uh oh!

Thearas commented Feb 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zclllyybb commented Feb 2, 2026

Uh oh!

Uh oh!

zclllyybb commented Mar 14, 2026

Uh oh!

zclllyybb commented Mar 14, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary for PR #60412: Add levenshtein and hamming_distance functions

Critical Checkpoint Conclusions

Uh oh!

github-actions bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doris-robot commented Mar 14, 2026

Uh oh!

doris-robot commented Mar 14, 2026

Uh oh!

hello-stephen commented Mar 14, 2026

FE UT Coverage Report

Uh oh!

doris-robot commented Mar 14, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Mar 14, 2026

BE Regression && UT Coverage Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment