perf: Use batched row conversion for `array_has_any`, `array_has_all` by neilconway · Pull Request #20588 · apache/datafusion

neilconway · 2026-02-27T02:35:23Z

Which issue does this PR close?

Closes Use batched row conversion for array_has_any, array_has_all #20587 .

Rationale for this change

array_has_any and array_has_all called RowConverter::convert_columns twice for every input row. convert_columns has a lot of per-call overhead: allocating a new Rows buffer, doing various schema checking, and so on.

It is considerably more efficient to use RowConverter twice up front and convert all of the haystack and needle inputs in bulk. We can then implement the has_any / has_all predicate comparison by indexing into the converted rows.

array_has_any / array_has_all had a special-case for strings, but it had an analogous problem: it iterated over rows, materialized each row's inner list, and then called string_array_to_vec twice per row. That does a lot of per-row work; it is significantly faster to call string_array_to_vec on all input rows at once, and then index into the results to implement the per-row comparisons.

What changes are included in this PR?

Implement optimization
Improve test coverage for sliced arrays; not strictly related to this PR but more coverage for this codepath made me feel more comfortable

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway · 2026-02-27T03:00:51Z

Benchmarks:

   RowConverter path (i64)

┌─────────────────────────────┬──────┬──────────┬──────────┬────────┐
  │          Benchmark          │ Size │   Main   │  Branch  │ Change │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 10   │ 3.03 ms  │ 0.65 ms  │ -78%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 10   │ 2.74 ms  │ 0.47 ms  │ -83%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 10   │ 2.83 ms  │ 0.57 ms  │ -80%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 10   │ 3.35 ms  │ 1.04 ms  │ -69%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 100  │ 7.47 ms  │ 4.86 ms  │ -35%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 100  │ 6.78 ms  │ 4.07 ms  │ -40%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 100  │ 7.11 ms  │ 4.60 ms  │ -35%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 100  │ 11.95 ms │ 9.47 ms  │ -21%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 500  │ 33.35 ms │ 32.10 ms │ -4%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 500  │ 29.57 ms │ 28.59 ms │ -3%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 500  │ 30.80 ms │ 29.64 ms │ -4%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 500  │ 59.78 ms │ 59.89 ms │ ~0%    │
  └─────────────────────────────┴──────┴──────────┴──────────┴────────┘

  String path

  ┌─────────────────────────────────────┬──────┬──────────┬──────────┬────────┐
  │              Benchmark              │ Size │   Main   │  Branch  │ Change │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 10   │ 1.92 ms  │ 1.17 ms  │ -39%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 10   │ 1.41 ms  │ 0.71 ms  │ -50%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 10   │ 1.74 ms  │ 1.04 ms  │ -40%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 10   │ 1.96 ms  │ 1.28 ms  │ -35%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 100  │ 6.35 ms  │ 5.50 ms  │ -13%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 100  │ 5.55 ms  │ 4.73 ms  │ -15%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 100  │ 5.75 ms  │ 4.97 ms  │ -14%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 100  │ 9.65 ms  │ 8.79 ms  │ -9%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 500  │ 28.05 ms │ 26.96 ms │ -4%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 500  │ 30.38 ms │ 29.47 ms │ -3%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 500  │ 25.01 ms │ 24.04 ms │ -4%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 500  │ 58.31 ms │ 57.32 ms │ -2%    │
  └─────────────────────────────────────┴──────┴──────────┴──────────┴────────┘

It's a significant win for short arrays, and a small win for large arrays. For large arrays, the N*M comparison cost probably dominates. We should probably be able to do something smarter by hashing, I'll look at that shortly but in a separate PR.

perf: Use batched row conversion for array_has_any, array_has_all

396bec0

github-actions bot added the functions Changes to functions implementation label Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use batched row conversion for `array_has_any`, `array_has_all`#20588

perf: Use batched row conversion for `array_has_any`, `array_has_all`#20588
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/optimize-array-has-any-all-rowconvert

neilconway commented Feb 27, 2026 •

edited

Loading

Uh oh!

neilconway commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neilconway commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neilconway commented Feb 27, 2026 •

edited

Loading

neilconway commented Feb 27, 2026 •

edited

Loading