Skip to content

perf(hash-join): eliminate intermediate array allocations in probe-side collision filter#23209

Draft
LiaCastaneda wants to merge 1 commit into
apache:mainfrom
LiaCastaneda:perf/hash-join-no-take-in-probe
Draft

perf(hash-join): eliminate intermediate array allocations in probe-side collision filter#23209
LiaCastaneda wants to merge 1 commit into
apache:mainfrom
LiaCastaneda:perf/hash-join-no-take-in-probe

Conversation

@LiaCastaneda

@LiaCastaneda LiaCastaneda commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

When DataFusion executes a HashJoin, for each probe batch it collects all candidate (build_idx, probe_idx) pairs from the hash table, then runs a key equality check to filter out hash collisions. This check existed because two different key values can hash to the same bucket (although it's unusual) -- so a hash match alone is not sufficient to confirm a true join match.

The current implementation (equal_rows_arr) does this confirmation by materializing intermediate arrays sized to the total number of matched pairs

  1. take(build_key_col, matched_indices) -- gather all matched build-side key values into a new array
  2. take(probe_key_col, matched_indices) -- same for the probe side
  3. eq_dyn_null(...) -- compare element-wise into a boolean array
  4. FilterBuilder -- build a selection bitmask and filter both index arrays

This is O(matched_pairs) in both allocations and comparisons. For a typical star-schema join with fanout ≈ 1, matched_pairs ~ probe_rows and the cost is acceptable. But for joins with high build-side fanout (many build rows per key value), matched_pairs = probe_rows × fanout, and the intermediate arrays grow proportionally. At fanout 78 over 136M output rows, this means allocating and discarding ~4 arrays of 136M elements per probe batch -- the majority of which pass the filter unchanged since true hash collisions are rare.

This cost is most visible when:

  • The build side has duplicate keys
  • The join output is large (high fanout × large probe side)
  • Hash partition skew concentrates most matched pairs onto a single partition, making the per-partition allocation cost directly observable as query latency

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…be path

Replace the take() + eq_dyn_null() + FilterBuilder collision-filter in
lookup_join_hashmap with a JoinKeyComparator-based in-place loop.

The old equal_rows_arr allocated O(matched_pairs) intermediate arrays per
probe batch (one take-gathered key array per key column per side, plus a
BooleanArray and FilterBuilder result). At high fanout these dominate:
fanout=78 over 136M output rows means ~2 × N_key_cols × 136M array
elements allocated and discarded every probe batch.

JoinKeyComparator builds make_comparator closures once per (build_batch,
probe_batch) column pair, then walks the candidate index pairs with a
plain Rust loop — no intermediate arrays, no Arrow kernel dispatch per
row. Null semantics (NullEqualsNull / NullEqualsNothing) are baked into
the closures at construction time, preserving correctness.

Add JoinKeyComparator::for_equality as a convenience constructor for
callers that only need equality (no sort options required).

Closes apache#12131

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LiaCastaneda LiaCastaneda changed the title perf(hash-join): replace equal_rows_arr with JoinKeyComparator in pro… perf(hash-join): eliminate intermediate array allocations in probe-side collision filter Jun 26, 2026
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant