Skip to content

fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188

Open
oc7o wants to merge 2 commits into
apache:mainfrom
oc7o:bugfix/similar-to-wildcard
Open

fix: translate SQL wildcards in SIMILAR TO patterns (#22263)#23188
oc7o wants to merge 2 commits into
apache:mainfrom
oc7o:bugfix/similar-to-wildcard

Conversation

@oc7o

@oc7o oc7o commented Jun 25, 2026

Copy link
Copy Markdown

SIMILAR TO previously passed the pattern straight to Arrow's regex engine, so SQL wildcards were never translated and matches were unanchored:

SELECT 'abc' SIMILAR TO 'a%';  -- returned false
SELECT 'x'   SIMILAR TO '_';   -- returned false

Translate % to .* and _ to ., then wrap the pattern in ^(?:...)$ so the regex matches the entire string. Other regex metacharacters (|, (, ), *, +, ?) pass through unchanged, matching SIMILAR TO's superset-of-regex semantics.

The translation only fires for literal Utf8, LargeUtf8, and Utf8View patterns. Non-literal patterns return a not_impl_err! — silently wrong results are worse than an honest error, and this mirrors how DataFusion already handles the unsupported ESCAPE clause. NULL patterns pass through unchanged.

Which issue does this PR close?

Rationale for this change

SIMILAR TO is a SQL standard operator with well-defined wildcard semantics (% = any sequence, _ = single character, full-string match). DataFusion's current behavior silently produces wrong results for the most basic patterns, which is a correctness bug for anyone porting queries from Postgres or other SQL engines.

What changes are included in this PR?

  • New sql_similar_to_regex helper in datafusion/physical-expr/src/expressions/binary.rs that translates %/_ and anchors the pattern with ^(?:...)$.
  • similar_to() now translates the pattern for literal Utf8 / LargeUtf8 / Utf8View values, passes NULL through unchanged, and returns not_impl_err! for non-literal patterns.

Are these changes tested?

Yes:

  • Existing test_similar_to in binary.rs was relying on the bug by passing raw regex strings; rewritten to use SQL wildcard syntax.
  • New unit tests cover %, _, full-string anchoring, regex-metacharacter passthrough, NULL pattern, and the non-literal-pattern error path.
  • End-to-end coverage added in datafusion/sqllogictest/test_files/strings.slt.

Are there any user-facing changes?

Yes — SIMILAR TO now produces correct results for queries that were previously returning wrong answers. Queries that happened to rely on the buggy behavior (passing raw regex through SIMILAR TO) will change. No public API changes.

`SIMILAR TO` previously passed the pattern straight to Arrow's regex
engine, so SQL wildcards were never translated and matches were
unanchored:

    SELECT 'abc' SIMILAR TO 'a%';  -- returned false
    SELECT 'x'   SIMILAR TO '_';   -- returned false

Translate `%` to `.*` and `_` to `.`, then wrap the pattern in
`^(?:...)$` so the regex matches the entire string. Other regex
metacharacters (`|`, `(`, `)`, `*`, `+`, `?`) pass through unchanged,
matching `SIMILAR TO`'s superset-of-regex semantics.

The translation only fires for literal `Utf8`, `LargeUtf8`, and
`Utf8View` patterns. Non-literal patterns return a `not_impl_err!` —
silently wrong results are worse than an honest error, and this mirrors
how DataFusion already handles the unsupported `ESCAPE` clause. NULL
patterns pass through unchanged.

Existing tests in `binary.rs` were relying on the bug by passing raw
regex strings as `SIMILAR TO` patterns; they have been rewritten to use
SQL wildcard syntax, and new cases cover `%`, `_`, full-string
anchoring, and regex-metacharacter passthrough. End-to-end coverage
added in `strings.slt`.
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Jun 25, 2026
@oc7o

oc7o commented Jun 25, 2026

Copy link
Copy Markdown
Author

@huaxingao @viirya @wesm Could one of you trigger CI for me please? Thanks!

@viirya

viirya commented Jun 25, 2026

Copy link
Copy Markdown
Member

@oc7o Triggered. CI is running now.

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oc7o
Thanks for working on this. The direction looks good, but I think there are still a couple of correctness issues in the SIMILAR TO translation that should be addressed before this can be merged. I also have one small test coverage suggestion.

match ch {
'%' => result.push_str(".*"),
'_' => result.push('.'),
c => result.push(c),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this. I think there is still one correctness issue here.

SIMILAR TO currently copies every non-% and non-_ character directly into the Arrow regex. That means regex metacharacters like ., ^, and $ are still treated as regex operators even though they are literals in SQL SIMILAR TO patterns.

For example, the SQL pattern a. should only match the literal string a., but the current translation produces ^(?:a.)$, so SELECT 'ab' SIMILAR TO 'a.' incorrectly returns true.

Could we translate the SQL pattern grammar explicitly instead? SQL literals should be escaped for the regex, and only the metacharacters that SIMILAR TO actually defines should be emitted as regex syntax.

result.push_str("^(?:");
for ch in pattern.chars() {
match ch {
'%' => result.push_str(".*"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other correctness issue I noticed is that % and _ are translated to .* and ., but Rust and Arrow regexes do not let . match newlines by default.

SQL wildcards are expected to match any character, including newlines, so values containing \n still behave incorrectly. For example, 'a\nb' SIMILAR TO 'a%b' should match, but the generated ^(?:a.*b)$ does not.

Could we use a dot-all form such as (?s:.*) and (?s:.), or an equivalent character class? It would also be great to add a regression test for this case.

P1m1e1
e1

# SIMILAR TO with `%` wildcard (zero or more characters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new SLT coverage covers the common %, _, and anchoring cases well. After updating the translator, could we also add a small regression test showing that regex metacharacters are treated as literals? For example, SIMILAR TO 'a.' should match 'a.' but not 'ab'. That should help prevent accidental regressions back to raw regex semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL compatibility: SIMILAR TO should treat % as a wildcard

3 participants