fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188
fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188oc7o wants to merge 2 commits into
Conversation
`SIMILAR TO` previously passed the pattern straight to Arrow's regex
engine, so SQL wildcards were never translated and matches were
unanchored:
SELECT 'abc' SIMILAR TO 'a%'; -- returned false
SELECT 'x' SIMILAR TO '_'; -- returned false
Translate `%` to `.*` and `_` to `.`, then wrap the pattern in
`^(?:...)$` so the regex matches the entire string. Other regex
metacharacters (`|`, `(`, `)`, `*`, `+`, `?`) pass through unchanged,
matching `SIMILAR TO`'s superset-of-regex semantics.
The translation only fires for literal `Utf8`, `LargeUtf8`, and
`Utf8View` patterns. Non-literal patterns return a `not_impl_err!` —
silently wrong results are worse than an honest error, and this mirrors
how DataFusion already handles the unsupported `ESCAPE` clause. NULL
patterns pass through unchanged.
Existing tests in `binary.rs` were relying on the bug by passing raw
regex strings as `SIMILAR TO` patterns; they have been rewritten to use
SQL wildcard syntax, and new cases cover `%`, `_`, full-string
anchoring, and regex-metacharacter passthrough. End-to-end coverage
added in `strings.slt`.
|
@huaxingao @viirya @wesm Could one of you trigger CI for me please? Thanks! |
|
@oc7o Triggered. CI is running now. |
There was a problem hiding this comment.
@oc7o
Thanks for working on this. The direction looks good, but I think there are still a couple of correctness issues in the SIMILAR TO translation that should be addressed before this can be merged. I also have one small test coverage suggestion.
| match ch { | ||
| '%' => result.push_str(".*"), | ||
| '_' => result.push('.'), | ||
| c => result.push(c), |
There was a problem hiding this comment.
Thanks for tackling this. I think there is still one correctness issue here.
SIMILAR TO currently copies every non-% and non-_ character directly into the Arrow regex. That means regex metacharacters like ., ^, and $ are still treated as regex operators even though they are literals in SQL SIMILAR TO patterns.
For example, the SQL pattern a. should only match the literal string a., but the current translation produces ^(?:a.)$, so SELECT 'ab' SIMILAR TO 'a.' incorrectly returns true.
Could we translate the SQL pattern grammar explicitly instead? SQL literals should be escaped for the regex, and only the metacharacters that SIMILAR TO actually defines should be emitted as regex syntax.
| result.push_str("^(?:"); | ||
| for ch in pattern.chars() { | ||
| match ch { | ||
| '%' => result.push_str(".*"), |
There was a problem hiding this comment.
One other correctness issue I noticed is that % and _ are translated to .* and ., but Rust and Arrow regexes do not let . match newlines by default.
SQL wildcards are expected to match any character, including newlines, so values containing \n still behave incorrectly. For example, 'a\nb' SIMILAR TO 'a%b' should match, but the generated ^(?:a.*b)$ does not.
Could we use a dot-all form such as (?s:.*) and (?s:.), or an equivalent character class? It would also be great to add a regression test for this case.
| P1m1e1 | ||
| e1 | ||
|
|
||
| # SIMILAR TO with `%` wildcard (zero or more characters) |
There was a problem hiding this comment.
The new SLT coverage covers the common %, _, and anchoring cases well. After updating the translator, could we also add a small regression test showing that regex metacharacters are treated as literals? For example, SIMILAR TO 'a.' should match 'a.' but not 'ab'. That should help prevent accidental regressions back to raw regex semantics.
SIMILAR TOpreviously passed the pattern straight to Arrow's regex engine, so SQL wildcards were never translated and matches were unanchored:Translate
%to.*and_to., then wrap the pattern in^(?:...)$so the regex matches the entire string. Other regex metacharacters (|,(,),*,+,?) pass through unchanged, matchingSIMILAR TO's superset-of-regex semantics.The translation only fires for literal
Utf8,LargeUtf8, andUtf8Viewpatterns. Non-literal patterns return anot_impl_err!— silently wrong results are worse than an honest error, and this mirrors how DataFusion already handles the unsupportedESCAPEclause. NULL patterns pass through unchanged.Which issue does this PR close?
SIMILAR TOshould treat%as a wildcard #22263.Rationale for this change
SIMILAR TOis a SQL standard operator with well-defined wildcard semantics (%= any sequence,_= single character, full-string match). DataFusion's current behavior silently produces wrong results for the most basic patterns, which is a correctness bug for anyone porting queries from Postgres or other SQL engines.What changes are included in this PR?
sql_similar_to_regexhelper indatafusion/physical-expr/src/expressions/binary.rsthat translates%/_and anchors the pattern with^(?:...)$.similar_to()now translates the pattern for literalUtf8/LargeUtf8/Utf8Viewvalues, passesNULLthrough unchanged, and returnsnot_impl_err!for non-literal patterns.Are these changes tested?
Yes:
test_similar_toinbinary.rswas relying on the bug by passing raw regex strings; rewritten to use SQL wildcard syntax.%,_, full-string anchoring, regex-metacharacter passthrough, NULL pattern, and the non-literal-pattern error path.datafusion/sqllogictest/test_files/strings.slt.Are there any user-facing changes?
Yes —
SIMILAR TOnow produces correct results for queries that were previously returning wrong answers. Queries that happened to rely on the buggy behavior (passing raw regex throughSIMILAR TO) will change. No public API changes.