Skip to content

dictionary encoded group by's#20563

Draft
alexanderbianchi wants to merge 1 commit intoapache:mainfrom
alexanderbianchi:bianchi/repartitiondictencoding
Draft

dictionary encoded group by's#20563
alexanderbianchi wants to merge 1 commit intoapache:mainfrom
alexanderbianchi:bianchi/repartitiondictencoding

Conversation

@alexanderbianchi
Copy link
Contributor

Which issue does this PR close?

Partially addresses #7647

Rationale for this change

Dictionary-encoded GROUP BY columns currently fall back to GroupValuesRows (Arrow's RowConverter), which:

  1. Hydrates dictionary values into row format — losing the dictionary encoding entirely
  2. Re-encodes back to dictionary via cast() on emit
  3. Misses the vectorized comparison/append optimizations in the GroupValuesColumn fast path

This is the same problem that PR #8291 attempted to solve by materializing dictionaries at the AggregateExec level. That approach was reverted because it changed the output schema (Dictionary(Int32, Utf8)Utf8), which broke UNION queries with a RowConverter column schema mismatch error (#8738).

See also the related arrow-rs discussion on stateless row conversion (arrow-rs #4811, arrow-rs #4813, arrow-rs #4819).

What changes are included in this PR?

This PR takes a different approach: instead of changing schemas or materializing dictionaries at the plan level, it adds dictionary support internally to GroupValuesColumn via a thin wrapper that resolves dictionary keys on-demand.

How it differs from PR #8291

PR #8291 (reverted) This PR
Where AggregateExec level — changes output schema Inside GroupValuesColumn — completely internal
Schema impact Output changed from Dictionary to value type Output schema is unchanged — still Dictionary
Risk Broke UNION queries due to schema mismatch No schema changes visible outside GroupValuesColumn

Implementation

DictionaryGroupValueBuilder<K> — a new GroupColumn wrapper (~180 lines) that:

  • Receives DictionaryArray<K> inputs
  • Resolves dictionary keys to value-array indices on-demand (keys.value(row).as_usize())
  • Delegates to an inner value-type builder (e.g., ByteGroupValueBuilder for Utf8) for storage and comparison
  • Only copies data for rows that become new groups — no O(batch_size) materialization

The wrapper has fast paths for the common case (no null dictionary keys) where it remaps indices in bulk and delegates to the inner builder's vectorized operations.

make_group_column() — replaces the old instantiate_primitive! macro with a function that returns Result<Box<dyn GroupColumn>>. This naturally supports the recursive Dictionary wrapping (create inner builder for value type, wrap with DictionaryGroupValueBuilder) and simplifies intern() to a 3-line loop.

supported_type() — now accepts Dictionary(_, vt) where the value type is itself supported.

No changes to the emit path — it already re-encodes value-type arrays back to Dictionary via cast().

Files changed

File Change
multi_group_by/dictionary.rs New DictionaryGroupValueBuilder<K> wrapper
multi_group_by/mod.rs supported_type(), make_group_column(), simplified intern(), 3 new tests
group_values/mod.rs mod rowpub mod row (benchmark visibility)
benches/dict_group_values.rs New benchmark
Cargo.toml Register benchmark

Are these changes tested?

Yes:

  • test_supported_type_dictionary — verifies supported_type accepts/rejects dictionary types correctly
  • test_intern_with_dictionary_columns — multi-column grouping with Dictionary(Int32, Utf8) + Int64, overlapping groups across batches, verifies group indices and emit output types
  • test_dictionary_null_handling — null dictionary keys produce null groups, verified across batches
  • All 29 existing group_values unit tests continue to pass
  • Existing dictionary integration tests (aggregates::dict_nulls) pass

Are there any user-facing changes?

No. The output schema, output types, and query results are identical. Dictionary-encoded GROUP BY columns now route through GroupValuesColumn instead of GroupValuesRows, which is purely an internal performance optimization.

Benchmark results

Benchmark compares three paths for GROUP BY on Dictionary(Int32, Utf8) columns:

  • column_utf8: GroupValuesColumn with plain Utf8 (fast-path baseline)
  • column_dict: GroupValuesColumn with Dictionary — this PR
  • rows_dict: GroupValuesRows with Dictionary — current main

Parameters: 10 batches per iteration, cardinalities 50 / 1,000 / 10,000.

Scenario                                       column_utf8   column_dict (new)   rows_dict (current)   Speedup vs current
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
card_50    / batch_8192                           1.82 ms        1.51 ms              1.42 ms              0.94x
card_50    / batch_65536                         15.17 ms       12.21 ms             11.96 ms              0.98x
card_1000  / batch_8192                           1.84 ms        1.57 ms              1.83 ms              1.16x
card_1000  / batch_65536                         15.25 ms       12.72 ms             13.75 ms              1.08x
card_10000 / batch_8192                           2.25 ms        2.31 ms              3.68 ms              1.59x
card_10000 / batch_65536                         16.08 ms       14.59 ms             20.77 ms              1.42x

At low cardinality (50 distinct values), rows_dict and column_dict are comparable — the RowConverter is efficient when it has few values to intern. At moderate-to-high cardinality (1,000–10,000), column_dict is 1.1–1.6x faster than the current GroupValuesRows path, with the advantage growing as cardinality increases.

column_dict is also consistently faster than column_utf8 (plain strings) because the dictionary's compact values array is more cache-friendly than scanning the full batch.

Run with: cargo bench -p datafusion-physical-plan --bench dict_group_values --features test_utils

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant