Add levenshtein and hamming_distance functions#60412
Add levenshtein and hamming_distance functions#60412KazeBox33 wants to merge 15 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR adds support for two string distance functions: levenshtein (Hive compatibility) and hamming_distance (Trino/Presto compatibility). These functions compute edit distance and character-difference distance between strings respectively, with proper UTF-8 character handling.
Changes:
- Implemented
levenshteinfunction that calculates the minimum number of single-character edits needed to transform one string into another - Implemented
hamming_distancefunction that counts character differences between equal-length strings - Added comprehensive test coverage for both functions including edge cases, NULL handling, and UTF-8 character support
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test_string_all.groovy | Added regression tests for levenshtein and hamming_distance functions with various input scenarios |
| test_string_function.groovy | Added nereids query tests covering edge cases and error conditions for both functions |
| test_string_all.out | Expected output for levenshtein and hamming_distance test cases |
| test_string_function.out | Expected output for nereids function tests |
| ScalarFunctionVisitor.java | Added visitor methods for HammingDistance and Levenshtein function nodes |
| Levenshtein.java | Implemented Levenshtein scalar function class with signature definitions |
| HammingDistance.java | Implemented HammingDistance scalar function class with signature definitions |
| StringArithmetic.java | Added executable implementations for levenshtein and hamming_distance with dynamic programming algorithm |
| BuiltinScalarFunctions.java | Registered new functions in the builtin scalar functions registry |
| simple_function_factory.h | Added function registration declarations for backend implementations |
| function_string.cpp | Minor whitespace cleanup |
| function_levenshtein.cpp | Backend C++ implementation of levenshtein with UTF-8 support |
| function_hamming_distance.cpp | Backend C++ implementation of hamming_distance with UTF-8 support and length validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');" | ||
| testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');") |
There was a problem hiding this comment.
Test identifier 'qt_hamming_distance_3331' has inconsistent numbering (3331 instead of 333). Should be 'qt_hamming_distance_333' to match the pattern used for other functions like 'qt_space_333'.
| qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');" | |
| testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');") | |
| qt_hamming_distance_333 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');" | |
| testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');" |
| // HAMMING_DISTANCE tests | ||
| qt_hamming_distance_3331 "SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');" | ||
| testFoldConst("SELECT hamming_distance('', ''), hamming_distance('abc', 'abc'), hamming_distance('abc', 'abd'), hamming_distance('你好', '你们');") | ||
| qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);" |
There was a problem hiding this comment.
Test identifier 'qt_hamming_distance_3332' has inconsistent numbering (3332 instead of 334). Should be 'qt_hamming_distance_334' to follow the sequential pattern.
| qt_hamming_distance_3332 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);" | |
| qt_hamming_distance_334 "SELECT hamming_distance('abc', 'abc'), hamming_distance(NULL, 'abc'), hamming_distance('abc', NULL);" |
| auto& res_data = res_column->get_data(); | ||
|
|
||
| for (size_t i = 0; i < input_rows_count; ++i) { | ||
| const StringRef left = left_col->get_data_at(i); |
There was a problem hiding this comment.
this is virtual function call. cast before you use it.
| size_t row) { | ||
| std::vector<size_t> left_offsets; | ||
| std::vector<size_t> right_offsets; | ||
| utf8_char_offsets(left, left_offsets); |
There was a problem hiding this comment.
you can check if it's all ASCII first. if true, no need to do heavy work for utf8.
There was a problem hiding this comment.
have used simd::VStringFunctions::is_ascii
| namespace doris::vectorized { | ||
| #include "common/compile_check_begin.h" | ||
|
|
||
| class FunctionLevenshtein : public IFunction { |
There was a problem hiding this comment.
these two functions could use a same base template. maybe we already have one.
| const size_t len = left_offsets.size(); | ||
| for (size_t i = 0; i < len; ++i) { | ||
| size_t left_off = left_offsets[i]; | ||
| size_t left_next = (i + 1 < len) ? left_offsets[i + 1] : left.size; |
There was a problem hiding this comment.
make it branchless here. is it meaningful for strings with different length?
|
and remember to format your code, add more cases which query from table |
…ToType Refactor two string distance functions from custom IFunction classes to FunctionBinaryToType-based implementations. Keep existing ASCII fast path, UTF-8 behavior and hamming length validation semantics. This addresses review feedback about using shared binary-function template style.
…ryToType and extend regression cases Refactor BE implementations to FunctionBinaryToType template style and keep UTF-8/ASCII semantics. Optimize hamming_distance UTF-8 loop branch handling. Add and align query_p0 + nereids_p0 constant/table test cases and expected outputs.
…unction regression cases
fb0c27f to
5ccb369
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Code Review Summary for PR #60412: Add levenshtein and hamming_distance functions
Overall this is a well-structured PR that adds two useful string distance functions. The FE-BE type consistency is correct, the constant folding implementations match the BE implementations, and the test coverage is reasonable. Below are the issues found, organized by the critical checkpoint conclusions.
Critical Checkpoint Conclusions
1. Goal and correctness: The PR adds levenshtein (returns Int32) and hamming_distance (returns Int64) string distance functions with UTF-8 support. The core algorithms are correct. FE constant folding, FE scalar function classes, BE implementations, and registration are all present.
2. Modification scope: The change is focused - 13 files, all related to the two new functions. Minor cosmetic changes in function_string.cpp (blank line changes) are unnecessary but harmless.
3. Concurrency: No concurrency concerns - these are pure stateless scalar functions.
4. Lifecycle management: No special lifecycle concerns.
5. Configuration items: None added. N/A.
6. Incompatible changes: None. New functions only.
7. Parallel code paths: hamming_distance uses a custom FunctionBinaryStringToTypeWithNull wrapper class instead of the standard FunctionBinaryToType used by levenshtein. This is necessary because hamming_distance can return Status::InvalidArgument for unequal-length strings, and the custom class provides proper row-level null handling with explicit nullable return type control. This is an acceptable design choice.
8. Special conditional checks: The string_ref_at helper in both files contains defensive if checks that violate AGENTS.md coding standards. See inline comments.
9. Test coverage: Good regression test coverage for both functions with ASCII, UTF-8, NULL, empty string, and table-based tests. Error case for hamming_distance is tested. However, there is an orphaned .out entry. See inline comments.
10. Observability: N/A for pure scalar functions.
11. Transaction/persistence: N/A.
12. FE-BE variable passing: FE-BE types are consistent (levenshtein: Int32/Int32, hamming_distance: Int64/Int64). PropagateNullable in FE aligns with BE null handling for both functions.
13. Performance: The levenshtein function has O(m*n) time complexity with no input size limit, which is a potential DoS concern for very large strings. See inline comments.
14. Other issues: Alias levenshtein_distance (common Hive name) is not registered - this may be intentional but worth considering for compatibility.
| DCHECK_LT(i, offsets.size()); | ||
| const size_t begin = (i == 0) ? 0 : offsets[i - 1]; | ||
| const size_t end = offsets[i]; | ||
| if (end <= begin || end > data.size()) { |
There was a problem hiding this comment.
Coding standards violation (AGENTS.md): This is defensive programming. In a well-formed ColumnString, end > begin always holds (every string has at least a \0 terminator) and end <= data.size() is a structural invariant. Per AGENTS.md: "Assert correctness only—never use defensive programming with if or similar constructs."
Also, the (i == 0) ? 0 : offsets[i - 1] ternary on line 83 is unnecessary — PaddedPODArray guarantees offsets[-1] == 0 (see column_string.h comment: "-1th index is Ok"). All standard code in the codebase simply uses offsets[i - 1].
Recommendation: Replace this entire string_ref_at helper with the standard pattern:
const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]);
size_t str_size = offsets[i] - offsets[i - 1];Or use column->get_data_at(i) when the column object is available (as done in the nullable path at line 252).
| } | ||
|
|
||
| size_t str_size = end - begin; | ||
| if (str_size > 0 && data[end - 1] == '\0') { |
There was a problem hiding this comment.
Non-standard pattern: Stripping the trailing \0 is not part of the standard ColumnString access pattern. ColumnString::get_data_at() returns the size including the terminating zero (see size_at() comment: "Size of i-th element, including terminating zero"). Other string functions in the codebase do not strip \0 from raw offset-based access either — they rely on the StringRef.size being computed as offsets[i] - offsets[i-1], which already accounts for the terminator.
Note that on line 252, get_data_at() is used (which includes the \0 in the size), creating an inconsistency within this same file between how string data is accessed in the non-nullable vs nullable paths.
This inconsistency could lead to subtle off-by-one differences: the non-nullable path (using string_ref_at) strips \0 and compares character content only, while the nullable path (using get_data_at) includes \0 in the comparison. For hamming_distance, this would mean different results depending on whether the input column is nullable or not.
| DCHECK_LT(i, offsets.size()); | ||
| const size_t begin = (i == 0) ? 0 : offsets[i - 1]; | ||
| const size_t end = offsets[i]; | ||
| if (end <= begin || end > data.size()) { |
There was a problem hiding this comment.
Same issue as in function_hamming_distance.cpp: defensive if check and non-standard (i == 0) ? 0 : offsets[i - 1] pattern. See the comment on the hamming_distance file for details.
Additionally, FunctionBinaryToType (which FunctionLevenshtein uses) employs use_default_implementation_for_nulls() = true, so the framework strips nullable wrappers and calls vector_vector/vector_scalar/scalar_vector with non-nullable columns. These methods use string_ref_at, but the framework also uses get_data_at() in some paths — maintaining consistency with the standard access pattern would be safer.
| std::memcmp(left.data + left_off, right.data + right_off, left_len) == 0; | ||
| } | ||
|
|
||
| static Int32 levenshtein_distance_ascii(const StringRef& left, const StringRef& right) { |
There was a problem hiding this comment.
Performance concern — no input size limit: levenshtein_distance is O(m*n) in both time and performs per-row std::vector allocations. Two 1MB strings would require ~10^12 operations. There is no guard against extremely long inputs.
Other functions in the codebase (e.g., functions_multi_string_search.cpp) have explicit size limits for expensive operations. Consider adding a reasonable character-count limit (e.g., 10,000 or 65,535 characters) and returning Status::InvalidArgument when exceeded. This would require changing the return type to Status (similar to hamming_distance).
This is a moderate concern — practical usage rarely involves very long strings, but it's a potential DoS vector via crafted SQL.
| 6 1 | ||
| 7 2 | ||
|
|
||
| -- !hamming_distance -- |
There was a problem hiding this comment.
Orphaned output entry: This -- !hamming_distance -- block with \N has no corresponding qt_hamming_distance query in the .groovy test file. After the qt_hamming_distance_tbl query (the table test), the groovy file proceeds to qt_soundex for non-ASCII testing — there is no additional qt_hamming_distance call.
This orphaned entry will cause the regression test to fail because the .out file has more output blocks than the .groovy file produces. The .out file should be auto-generated by running the test, not hand-edited. Please regenerate this file.
| }; | ||
|
|
||
| template <template <typename, typename> typename Impl, typename Name> | ||
| class FunctionBinaryStringToTypeWithNull : public IFunction { |
There was a problem hiding this comment.
why we need this new template? why not FunctionBinaryToType?
There was a problem hiding this comment.
Because hamming_distance needs different behavior from FunctionBinaryToType: return NULL when any argument is NULL, but still return InvalidArgument for non-NULL unequal-length inputs. I used this wrapper to keep that row-level NULL handling explicit.
if I do not use this to wrap , hamming_distance("abc",NULL) will return InvalidArgument .
TPC-H: Total hot run time: 26861 ms |
TPC-DS: Total hot run time: 167910 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| std::vector<size_t> left_offsets; | ||
| std::vector<size_t> right_offsets; | ||
| utf8_char_offsets(left, left_offsets); | ||
| utf8_char_offsets(right, right_offsets); |
There was a problem hiding this comment.
get const_col's offsets(vec_scalar/scalar_vec) once is more expected, and also put the vector offset outer as well, and reserve based on the max length per line
There was a problem hiding this comment.
levenshtein and hamming_distance have done
|
|
||
| std::vector<size_t> left_offsets; | ||
| std::vector<size_t> right_offsets; | ||
| utf8_char_offsets(left, left_offsets); |
| const Int32 insert_cost = curr[j - 1] + 1; | ||
| const Int32 delete_cost = prev[j] + 1; | ||
| const Int32 replace_cost = prev[j - 1] + cost; | ||
| curr[j] = std::min({insert_cost, delete_cost, replace_cost}); |
There was a problem hiding this comment.
use min(min(a, b), c) to avoid temporary object construction
|
run buildall |
TPC-H: Total hot run time: 26915 ms |
TPC-DS: Total hot run time: 168305 ms |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Related Issue: #48203
Related PR: #57144 (reference)
Problem Summary: support levenshtein (Hive) and hamming_distance (Trino/Presto).
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?