Skip to content

perf: Use batched row conversion for array_has_any, array_has_all#20588

Open
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/optimize-array-has-any-all-rowconvert
Open

perf: Use batched row conversion for array_has_any, array_has_all#20588
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/optimize-array-has-any-all-rowconvert

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Feb 27, 2026

Which issue does this PR close?

Rationale for this change

array_has_any and array_has_all called RowConverter::convert_columns twice for every input row. convert_columns has a lot of per-call overhead: allocating a new Rows buffer, doing various schema checking, and so on.

It is considerably more efficient to use RowConverter twice up front and convert all of the haystack and needle inputs in bulk. We can then implement the has_any / has_all predicate comparison by indexing into the converted rows.

array_has_any / array_has_all had a special-case for strings, but it had an analogous problem: it iterated over rows, materialized each row's inner list, and then called string_array_to_vec twice per row. That does a lot of per-row work; it is significantly faster to call string_array_to_vec on all input rows at once, and then index into the results to implement the per-row comparisons.

What changes are included in this PR?

  • Implement optimization
  • Improve test coverage for sliced arrays; not strictly related to this PR but more coverage for this codepath made me feel more comfortable

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Feb 27, 2026
@neilconway
Copy link
Contributor Author

neilconway commented Feb 27, 2026

Benchmarks:

   RowConverter path (i64)

┌─────────────────────────────┬──────┬──────────┬──────────┬────────┐
  │          Benchmark          │ Size │   Main   │  Branch  │ Change │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 10   │ 3.03 ms  │ 0.65 ms  │ -78%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 10   │ 2.74 ms  │ 0.47 ms  │ -83%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 10   │ 2.83 ms  │ 0.57 ms  │ -80%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 10   │ 3.35 ms  │ 1.04 ms  │ -69%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 100  │ 7.47 ms  │ 4.86 ms  │ -35%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 100  │ 6.78 ms  │ 4.07 ms  │ -40%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 100  │ 7.11 ms  │ 4.60 ms  │ -35%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 100  │ 11.95 ms │ 9.47 ms  │ -21%   │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/all_found     │ 500  │ 33.35 ms │ 32.10 ms │ -4%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all/not_all_found │ 500  │ 29.57 ms │ 28.59 ms │ -3%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/some_match    │ 500  │ 30.80 ms │ 29.64 ms │ -4%    │
  ├─────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any/no_match      │ 500  │ 59.78 ms │ 59.89 ms │ ~0%    │
  └─────────────────────────────┴──────┴──────────┴──────────┴────────┘

  String path

  ┌─────────────────────────────────────┬──────┬──────────┬──────────┬────────┐
  │              Benchmark              │ Size │   Main   │  Branch  │ Change │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 10   │ 1.92 ms  │ 1.17 ms  │ -39%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 10   │ 1.41 ms  │ 0.71 ms  │ -50%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 10   │ 1.74 ms  │ 1.04 ms  │ -40%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 10   │ 1.96 ms  │ 1.28 ms  │ -35%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 100  │ 6.35 ms  │ 5.50 ms  │ -13%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 100  │ 5.55 ms  │ 4.73 ms  │ -15%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 100  │ 5.75 ms  │ 4.97 ms  │ -14%   │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 100  │ 9.65 ms  │ 8.79 ms  │ -9%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/all_found     │ 500  │ 28.05 ms │ 26.96 ms │ -4%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_all_strings/not_all_found │ 500  │ 30.38 ms │ 29.47 ms │ -3%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/some_match    │ 500  │ 25.01 ms │ 24.04 ms │ -4%    │
  ├─────────────────────────────────────┼──────┼──────────┼──────────┼────────┤
  │ array_has_any_strings/no_match      │ 500  │ 58.31 ms │ 57.32 ms │ -2%    │
  └─────────────────────────────────────┴──────┴──────────┴──────────┴────────┘

It's a significant win for short arrays, and a small win for large arrays. For large arrays, the N*M comparison cost probably dominates. We should probably be able to do something smarter by hashing, I'll look at that shortly but in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use batched row conversion for array_has_any, array_has_all

1 participant