Skip to content

perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs#20581

Open
petern48 wants to merge 8 commits intoapache:mainfrom
petern48:regexp_optim_utf8view
Open

perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs#20581
petern48 wants to merge 8 commits intoapache:mainfrom
petern48:regexp_optim_utf8view

Conversation

@petern48
Copy link
Contributor

@petern48 petern48 commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

I ran into a bug that prevented some regexp optimizations from working that were introduced in #15299. After #16290, some SQL types were updated to utf8view. As part of that PR, some expected query plans in sqllogictest were updated to expect the unoptimized version.

I need this fixed to avoid additional test failures while implementing a new regexp optimization for #20579.

What changes are included in this PR?

  • Add support for Utf8View and LargeUtf8 in regex.rs.
  • Properly return Transformed::no() on cases when the plan isn't modified (previously, it was always returning Transformed::yes()
  • Updates the tests back to expect the optimized query plans

Are these changes tested?

Fixed existing tests that previously weren't working. Now they reflect the optimization being reflected properly.

Are there any user-facing changes?

No. Just applying the optimizations to more cases.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026
@petern48 petern48 changed the title bug: Apply logical regexp optimizations to utf8view inputs bug: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs Feb 26, 2026
@petern48 petern48 force-pushed the regexp_optim_utf8view branch from 1c8836b to e1f661b Compare February 26, 2026 21:41
@petern48 petern48 changed the title bug: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs perf: Apply logical regexp optimizations to Utf8View and LargeUtf8 inputs Feb 26, 2026
@petern48 petern48 marked this pull request as ready for review February 26, 2026 22:16
@alamb alamb added the performance Make DataFusion faster label Feb 27, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @petern48 -- this looks like an improvement to me

I do think we should add documentation for the newly added pub API (or I have some ideas on how to improve it too)

----
logical_plan
01)Projection: test.column1_utf8view ~ Utf8View("an") AS c1
01)Projection: test.column1_utf8view LIKE Utf8View("%an%") AS c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is certainly a better plan

}

fn as_string_scalar(expr: &Expr) -> Option<(DataType, &Option<String>)> {
pub fn as_string_scalar(expr: &Expr) -> Option<(DataType, &Option<String>)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to make this a pub API, we should at least document them, and maybe we can clean it up bit?

Maybe the simplest thing would be to just make it pub (crate) rather than pub

What do you think about something like

pub enum StringScalar<'a> {
  Utf8(&'a ScalarValue, &'str),
  LargeUtf8(&'a ScalarValue, &'str),
  Utf8View(&'a ScalarValue, &'str),
}

impl StringScalar {
  fn try_from_scalar(scalar: &ScalarValue) -> Self { 
...
  }
 
  fn to_scalar(&self, val: &str) -> Expr {
}

That would:

  1. Avoid creating DataTypes
  2. Put some better documentation in place
  3. Encapsulate the logic a bit more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: regexp simplify optimizations don't work on utf8view

2 participants