Skip to content

Add levenshtein and hamming_distance functions#60412

Open
KazeBox33 wants to merge 15 commits intoapache:masterfrom
KazeBox33:feature/levenshtein-hamming
Open

Add levenshtein and hamming_distance functions#60412
KazeBox33 wants to merge 15 commits intoapache:masterfrom
KazeBox33:feature/levenshtein-hamming

Conversation

@KazeBox33
Copy link

What problem does this PR solve?

Related Issue: #48203
Related PR: #57144 (reference)
Problem Summary: support levenshtein (Hive) and hamming_distance (Trino/Presto).

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
      • ./run-regression-test.sh --run -d regression-test/suites/query_p0/sql_functions/string_functions -s string_functions_all
      • ./run-regression-test.sh --run -d regression-test/suites/nereids_p0/sql_functions/string_functions -s test_string_function
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

@KazeBox33 KazeBox33 requested a review from zclllyybb as a code owner February 1, 2026 16:11
@Thearas
Copy link
Contributor

Thearas commented Feb 1, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for two string distance functions: levenshtein (Hive compatibility) and hamming_distance (Trino/Presto compatibility). These functions compute edit distance and character-difference distance between strings respectively, with proper UTF-8 character handling.

Changes:

  • Implemented levenshtein function that calculates the minimum number of single-character edits needed to transform one string into another
  • Implemented hamming_distance function that counts character differences between equal-length strings
  • Added comprehensive test coverage for both functions including edge cases, NULL handling, and UTF-8 character support

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test_string_all.groovy Added regression tests for levenshtein and hamming_distance functions with various input scenarios
test_string_function.groovy Added nereids query tests covering edge cases and error conditions for both functions
test_string_all.out Expected output for levenshtein and hamming_distance test cases
test_string_function.out Expected output for nereids function tests
ScalarFunctionVisitor.java Added visitor methods for HammingDistance and Levenshtein function nodes
Levenshtein.java Implemented Levenshtein scalar function class with signature definitions
HammingDistance.java Implemented HammingDistance scalar function class with signature definitions
StringArithmetic.java Added executable implementations for levenshtein and hamming_distance with dynamic programming algorithm
BuiltinScalarFunctions.java Registered new functions in the builtin scalar functions registry
simple_function_factory.h Added function registration declarations for backend implementations
function_string.cpp Minor whitespace cleanup
function_levenshtein.cpp Backend C++ implementation of levenshtein with UTF-8 support
function_hamming_distance.cpp Backend C++ implementation of hamming_distance with UTF-8 support and length validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +763 to +764
qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test identifier 'qt_hamming_distance_3331' has inconsistent numbering (3331 instead of 333). Should be 'qt_hamming_distance_333' to match the pattern used for other functions like 'qt_space_333'.

Suggested change
qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")
qt_hamming_distance_333 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"

Copilot uses AI. Check for mistakes.
// HAMMING_DISTANCE tests
qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');"
testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');")
qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test identifier 'qt_hamming_distance_3332' has inconsistent numbering (3332 instead of 334). Should be 'qt_hamming_distance_334' to follow the sequential pattern.

Suggested change
qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"
qt_hamming_distance_334 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);"

Copilot uses AI. Check for mistakes.
auto& res_data = res_column->get_data();

for (size_t i = 0; i < input_rows_count; ++i) {
const StringRef left = left_col->get_data_at(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is virtual function call. cast before you use it.

size_t row) {
std::vector<size_t> left_offsets;
std::vector<size_t> right_offsets;
utf8_char_offsets(left, left_offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check if it's all ASCII first. if true, no need to do heavy work for utf8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have used simd::VStringFunctions::is_ascii

namespace doris::vectorized {
#include "common/compile_check_begin.h"

class FunctionLevenshtein : public IFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two functions could use a same base template. maybe we already have one.

const size_t len = left_offsets.size();
for (size_t i = 0; i < len; ++i) {
size_t left_off = left_offsets[i];
size_t left_next = (i + 1 < len) ? left_offsets[i + 1] : left.size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it branchless here. is it meaningful for strings with different length?

@zclllyybb
Copy link
Contributor

and remember to format your code, add more cases which query from table

@KazeBox33 KazeBox33 requested a review from zclllyybb March 2, 2026 16:25
…ToType

Refactor two string distance functions from custom IFunction classes to FunctionBinaryToType-based implementations.

Keep existing ASCII fast path, UTF-8 behavior and hamming length validation semantics.

This addresses review feedback about using shared binary-function template style.
…ryToType and extend regression cases

Refactor BE implementations to FunctionBinaryToType template style and keep UTF-8/ASCII semantics.

Optimize hamming_distance UTF-8 loop branch handling.

Add and align query_p0 + nereids_p0 constant/table test cases and expected outputs.
@KazeBox33 KazeBox33 force-pushed the feature/levenshtein-hamming branch from fb0c27f to 5ccb369 Compare March 14, 2026 12:31
@zclllyybb
Copy link
Contributor

run buildall

@zclllyybb
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary for PR #60412: Add levenshtein and hamming_distance functions

Overall this is a well-structured PR that adds two useful string distance functions. The FE-BE type consistency is correct, the constant folding implementations match the BE implementations, and the test coverage is reasonable. Below are the issues found, organized by the critical checkpoint conclusions.

Critical Checkpoint Conclusions

1. Goal and correctness: The PR adds levenshtein (returns Int32) and hamming_distance (returns Int64) string distance functions with UTF-8 support. The core algorithms are correct. FE constant folding, FE scalar function classes, BE implementations, and registration are all present.

2. Modification scope: The change is focused - 13 files, all related to the two new functions. Minor cosmetic changes in function_string.cpp (blank line changes) are unnecessary but harmless.

3. Concurrency: No concurrency concerns - these are pure stateless scalar functions.

4. Lifecycle management: No special lifecycle concerns.

5. Configuration items: None added. N/A.

6. Incompatible changes: None. New functions only.

7. Parallel code paths: hamming_distance uses a custom FunctionBinaryStringToTypeWithNull wrapper class instead of the standard FunctionBinaryToType used by levenshtein. This is necessary because hamming_distance can return Status::InvalidArgument for unequal-length strings, and the custom class provides proper row-level null handling with explicit nullable return type control. This is an acceptable design choice.

8. Special conditional checks: The string_ref_at helper in both files contains defensive if checks that violate AGENTS.md coding standards. See inline comments.

9. Test coverage: Good regression test coverage for both functions with ASCII, UTF-8, NULL, empty string, and table-based tests. Error case for hamming_distance is tested. However, there is an orphaned .out entry. See inline comments.

10. Observability: N/A for pure scalar functions.

11. Transaction/persistence: N/A.

12. FE-BE variable passing: FE-BE types are consistent (levenshtein: Int32/Int32, hamming_distance: Int64/Int64). PropagateNullable in FE aligns with BE null handling for both functions.

13. Performance: The levenshtein function has O(m*n) time complexity with no input size limit, which is a potential DoS concern for very large strings. See inline comments.

14. Other issues: Alias levenshtein_distance (common Hive name) is not registered - this may be intentional but worth considering for compatibility.

DCHECK_LT(i, offsets.size());
const size_t begin = (i == 0) ? 0 : offsets[i - 1];
const size_t end = offsets[i];
if (end <= begin || end > data.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding standards violation (AGENTS.md): This is defensive programming. In a well-formed ColumnString, end > begin always holds (every string has at least a \0 terminator) and end <= data.size() is a structural invariant. Per AGENTS.md: "Assert correctness only—never use defensive programming with if or similar constructs."

Also, the (i == 0) ? 0 : offsets[i - 1] ternary on line 83 is unnecessary — PaddedPODArray guarantees offsets[-1] == 0 (see column_string.h comment: "-1th index is Ok"). All standard code in the codebase simply uses offsets[i - 1].

Recommendation: Replace this entire string_ref_at helper with the standard pattern:

const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]);
size_t str_size = offsets[i] - offsets[i - 1];

Or use column->get_data_at(i) when the column object is available (as done in the nullable path at line 252).

}

size_t str_size = end - begin;
if (str_size > 0 && data[end - 1] == '\0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-standard pattern: Stripping the trailing \0 is not part of the standard ColumnString access pattern. ColumnString::get_data_at() returns the size including the terminating zero (see size_at() comment: "Size of i-th element, including terminating zero"). Other string functions in the codebase do not strip \0 from raw offset-based access either — they rely on the StringRef.size being computed as offsets[i] - offsets[i-1], which already accounts for the terminator.

Note that on line 252, get_data_at() is used (which includes the \0 in the size), creating an inconsistency within this same file between how string data is accessed in the non-nullable vs nullable paths.

This inconsistency could lead to subtle off-by-one differences: the non-nullable path (using string_ref_at) strips \0 and compares character content only, while the nullable path (using get_data_at) includes \0 in the comparison. For hamming_distance, this would mean different results depending on whether the input column is nullable or not.

DCHECK_LT(i, offsets.size());
const size_t begin = (i == 0) ? 0 : offsets[i - 1];
const size_t end = offsets[i];
if (end <= begin || end > data.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in function_hamming_distance.cpp: defensive if check and non-standard (i == 0) ? 0 : offsets[i - 1] pattern. See the comment on the hamming_distance file for details.

Additionally, FunctionBinaryToType (which FunctionLevenshtein uses) employs use_default_implementation_for_nulls() = true, so the framework strips nullable wrappers and calls vector_vector/vector_scalar/scalar_vector with non-nullable columns. These methods use string_ref_at, but the framework also uses get_data_at() in some paths — maintaining consistency with the standard access pattern would be safer.

std::memcmp(left.data + left_off, right.data + right_off, left_len) == 0;
}

static Int32 levenshtein_distance_ascii(const StringRef& left, const StringRef& right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern — no input size limit: levenshtein_distance is O(m*n) in both time and performs per-row std::vector allocations. Two 1MB strings would require ~10^12 operations. There is no guard against extremely long inputs.

Other functions in the codebase (e.g., functions_multi_string_search.cpp) have explicit size limits for expensive operations. Consider adding a reasonable character-count limit (e.g., 10,000 or 65,535 characters) and returning Status::InvalidArgument when exceeded. This would require changing the return type to Status (similar to hamming_distance).

This is a moderate concern — practical usage rarely involves very long strings, but it's a potential DoS vector via crafted SQL.

6 1
7 2

-- !hamming_distance --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphaned output entry: This -- !hamming_distance -- block with \N has no corresponding qt_hamming_distance query in the .groovy test file. After the qt_hamming_distance_tbl query (the table test), the groovy file proceeds to qt_soundex for non-ASCII testing — there is no additional qt_hamming_distance call.

This orphaned entry will cause the regression test to fail because the .out file has more output blocks than the .groovy file produces. The .out file should be auto-generated by running the test, not hand-edited. Please regenerate this file.

};

template <template <typename, typename> typename Impl, typename Name>
class FunctionBinaryStringToTypeWithNull : public IFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this new template? why not FunctionBinaryToType?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because hamming_distance needs different behavior from FunctionBinaryToType: return NULL when any argument is NULL, but still return InvalidArgument for non-NULL unequal-length inputs. I used this wrapper to keep that row-level NULL handling explicit.
if I do not use this to wrap , hamming_distance("abc",NULL) will return InvalidArgument .

@doris-robot
Copy link

TPC-H: Total hot run time: 26861 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 68d4ca6033fa03d84cbcccd04c127de1e003f48d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17670	4442	4323	4323
q2	q3	10637	822	524	524
q4	4681	369	251	251
q5	7542	1202	1017	1017
q6	174	172	148	148
q7	797	859	671	671
q8	9301	1476	1343	1343
q9	4919	4808	4726	4726
q10	6232	1904	1664	1664
q11	471	261	261	261
q12	684	579	470	470
q13	18056	2918	2179	2179
q14	233	237	221	221
q15	q16	729	724	672	672
q17	713	842	430	430
q18	6011	5356	5255	5255
q19	1141	1003	609	609
q20	541	493	383	383
q21	4513	1817	1388	1388
q22	348	490	326	326
Total cold run time: 95393 ms
Total hot run time: 26861 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4792	4641	4583	4583
q2	q3	3871	4331	3812	3812
q4	863	1186	777	777
q5	4115	4369	4312	4312
q6	185	176	145	145
q7	1738	1632	1495	1495
q8	2509	2729	2519	2519
q9	7675	7414	7417	7414
q10	3772	3972	3581	3581
q11	506	431	428	428
q12	494	597	436	436
q13	2634	3139	2251	2251
q14	284	286	272	272
q15	q16	739	773	700	700
q17	1158	1407	1400	1400
q18	7210	6738	6576	6576
q19	872	863	921	863
q20	2016	2118	2050	2050
q21	3971	3477	3859	3477
q22	511	412	386	386
Total cold run time: 49915 ms
Total hot run time: 47477 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167910 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 68d4ca6033fa03d84cbcccd04c127de1e003f48d, data reload: false

query5	4363	650	523	523
query6	329	221	210	210
query7	4235	497	267	267
query8	344	257	239	239
query9	8703	2709	2731	2709
query10	504	387	333	333
query11	7000	5109	4885	4885
query12	176	126	124	124
query13	1248	457	337	337
query14	5776	3735	3439	3439
query14_1	2787	2795	2816	2795
query15	203	196	175	175
query16	1000	486	458	458
query17	1120	717	615	615
query18	2445	463	350	350
query19	222	211	186	186
query20	131	127	127	127
query21	214	132	110	110
query22	13223	14056	14624	14056
query23	16423	15859	15720	15720
query23_1	15647	15541	15331	15331
query24	7386	1636	1236	1236
query24_1	1239	1239	1239	1239
query25	559	483	443	443
query26	1231	280	150	150
query27	2754	495	298	298
query28	4475	1854	1837	1837
query29	886	590	502	502
query30	305	232	199	199
query31	1027	936	874	874
query32	88	72	74	72
query33	523	343	300	300
query34	899	926	521	521
query35	654	675	600	600
query36	1079	1099	964	964
query37	142	98	89	89
query38	2934	2943	2863	2863
query39	852	848	848	848
query39_1	811	804	793	793
query40	230	150	136	136
query41	62	57	58	57
query42	271	264	257	257
query43	252	245	215	215
query44	
query45	200	184	178	178
query46	916	1008	611	611
query47	2094	2139	2048	2048
query48	318	321	229	229
query49	623	460	385	385
query50	705	281	211	211
query51	4091	4089	4028	4028
query52	253	271	252	252
query53	302	334	281	281
query54	285	273	252	252
query55	90	85	78	78
query56	322	317	306	306
query57	1728	1787	1813	1787
query58	279	275	267	267
query59	2819	2962	2744	2744
query60	339	333	323	323
query61	154	143	150	143
query62	620	581	542	542
query63	320	280	273	273
query64	5088	1266	975	975
query65	
query66	1454	452	354	354
query67	24354	24429	24204	24204
query68	
query69	425	317	289	289
query70	936	890	955	890
query71	343	303	297	297
query72	2779	2679	2158	2158
query73	551	564	323	323
query74	9633	9585	9402	9402
query75	2912	2752	2468	2468
query76	2273	1083	702	702
query77	376	399	311	311
query78	10896	11058	10435	10435
query79	1152	806	557	557
query80	903	636	542	542
query81	520	255	227	227
query82	1357	146	120	120
query83	344	276	243	243
query84	251	119	99	99
query85	900	494	431	431
query86	417	340	289	289
query87	3132	3144	3027	3027
query88	3589	2675	2666	2666
query89	444	377	343	343
query90	1840	177	175	175
query91	168	169	133	133
query92	75	73	74	73
query93	939	882	503	503
query94	513	321	305	305
query95	579	410	315	315
query96	641	541	235	235
query97	2476	2480	2407	2407
query98	234	221	232	221
query99	1025	984	936	936
Total cold run time: 249265 ms
Total hot run time: 167910 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 3.03% (2/66) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.65% (19733/37482)
Line Coverage 36.25% (184414/508725)
Region Coverage 32.39% (142433/439808)
Branch Coverage 33.55% (62204/185420)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.64% (26281/36687)
Line Coverage 54.41% (275861/507003)
Region Coverage 51.53% (228722/443856)
Branch Coverage 53.06% (98633/185902)

std::vector<size_t> left_offsets;
std::vector<size_t> right_offsets;
utf8_char_offsets(left, left_offsets);
utf8_char_offsets(right, right_offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get const_col's offsets(vec_scalar/scalar_vec) once is more expected, and also put the vector offset outer as well, and reserve based on the max length per line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

levenshtein and hamming_distance have done


std::vector<size_t> left_offsets;
std::vector<size_t> right_offsets;
utf8_char_offsets(left, left_offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const Int32 insert_cost = curr[j - 1] + 1;
const Int32 delete_cost = prev[j] + 1;
const Int32 replace_cost = prev[j - 1] + cost;
curr[j] = std::min({insert_cost, delete_cost, replace_cost});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use min(min(a, b), c) to avoid temporary object construction

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@KazeBox33 KazeBox33 marked this pull request as draft March 15, 2026 09:07
@KazeBox33 KazeBox33 marked this pull request as ready for review March 15, 2026 11:16
@KazeBox33
Copy link
Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26915 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7f5e716f67e18a0cec437e795b6bfcdfc7bdb6e0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17660	4447	4272	4272
q2	q3	10642	798	517	517
q4	4711	363	255	255
q5	7826	1203	1034	1034
q6	198	180	148	148
q7	812	836	668	668
q8	10402	1481	1365	1365
q9	5566	4773	4792	4773
q10	6354	1940	1632	1632
q11	492	266	232	232
q12	738	570	468	468
q13	18036	2925	2171	2171
q14	230	227	226	226
q15	q16	736	723	663	663
q17	733	846	447	447
q18	5872	5351	5240	5240
q19	1336	983	600	600
q20	541	486	381	381
q21	4575	2019	1514	1514
q22	367	347	309	309
Total cold run time: 97827 ms
Total hot run time: 26915 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4754	4648	4531	4531
q2	q3	3894	4342	3803	3803
q4	875	1206	790	790
q5	4077	4357	4396	4357
q6	192	179	149	149
q7	1797	1630	1517	1517
q8	2502	2752	2563	2563
q9	7501	7499	7425	7425
q10	3700	3912	3672	3672
q11	514	486	406	406
q12	474	615	443	443
q13	2842	3453	2335	2335
q14	283	305	285	285
q15	q16	743	768	755	755
q17	1219	1387	1426	1387
q18	7274	6761	6725	6725
q19	952	878	852	852
q20	2112	2175	2035	2035
q21	4189	3490	3364	3364
q22	487	416	388	388
Total cold run time: 50381 ms
Total hot run time: 47782 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168305 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7f5e716f67e18a0cec437e795b6bfcdfc7bdb6e0, data reload: false

query5	4337	629	519	519
query6	327	225	219	219
query7	4207	473	273	273
query8	347	240	226	226
query9	8747	2729	2727	2727
query10	504	407	356	356
query11	6988	5045	4857	4857
query12	187	128	124	124
query13	1261	443	355	355
query14	5700	3705	3443	3443
query14_1	2838	2815	2799	2799
query15	210	198	179	179
query16	978	455	444	444
query17	894	717	631	631
query18	2448	446	350	350
query19	221	217	182	182
query20	134	132	129	129
query21	212	135	112	112
query22	13242	14310	14608	14310
query23	16250	15646	15589	15589
query23_1	15651	15784	15509	15509
query24	7133	1611	1194	1194
query24_1	1214	1218	1234	1218
query25	537	478	418	418
query26	1271	265	153	153
query27	2784	476	287	287
query28	4526	1854	1842	1842
query29	822	557	475	475
query30	294	222	186	186
query31	1012	964	882	882
query32	83	72	70	70
query33	507	338	274	274
query34	951	881	529	529
query35	637	667	596	596
query36	1112	1147	1018	1018
query37	130	91	82	82
query38	2964	2885	2845	2845
query39	861	823	802	802
query39_1	801	797	789	789
query40	229	153	137	137
query41	62	62	60	60
query42	251	250	257	250
query43	245	248	228	228
query44	
query45	194	241	191	191
query46	886	998	614	614
query47	2126	2156	2070	2070
query48	322	311	223	223
query49	623	481	397	397
query50	671	280	209	209
query51	4069	4061	3998	3998
query52	261	264	256	256
query53	296	338	280	280
query54	298	276	257	257
query55	99	89	79	79
query56	317	330	309	309
query57	1940	1776	1619	1619
query58	285	271	263	263
query59	2767	2926	2758	2758
query60	337	336	320	320
query61	154	156	155	155
query62	628	578	532	532
query63	307	277	273	273
query64	5047	1275	984	984
query65	
query66	1472	448	365	365
query67	24133	24151	24116	24116
query68	
query69	396	312	278	278
query70	974	971	951	951
query71	337	307	296	296
query72	2912	2875	2595	2595
query73	543	550	323	323
query74	9594	9556	9385	9385
query75	2862	2739	2522	2522
query76	2296	1038	664	664
query77	371	432	341	341
query78	10907	11000	10429	10429
query79	2931	797	580	580
query80	1751	628	538	538
query81	576	259	226	226
query82	993	148	118	118
query83	337	265	244	244
query84	301	121	98	98
query85	892	470	436	436
query86	501	291	292	291
query87	3131	3063	3061	3061
query88	3534	2623	2644	2623
query89	425	366	342	342
query90	2237	173	166	166
query91	165	160	143	143
query92	88	76	69	69
query93	2433	838	497	497
query94	650	313	300	300
query95	585	337	315	315
query96	642	513	238	238
query97	2487	2518	2412	2412
query98	239	222	219	219
query99	1032	995	905	905
Total cold run time: 252881 ms
Total hot run time: 168305 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 3.03% (2/66) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.63% (19726/37479)
Line Coverage 36.21% (184217/508686)
Region Coverage 32.35% (142166/439444)
Branch Coverage 33.55% (62171/185335)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.59% (26263/36685)
Line Coverage 54.36% (275609/506973)
Region Coverage 51.69% (229228/443493)
Branch Coverage 53.05% (98567/185817)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants