dictionary encoded group by's by alexanderbianchi · Pull Request #20563 · apache/datafusion

alexanderbianchi · 2026-02-26T03:17:45Z

Which issue does this PR close?

Partially addresses #7647

Rationale for this change

Dictionary-encoded GROUP BY columns currently fall back to GroupValuesRows (Arrow's RowConverter), which:

Hydrates dictionary values into row format — losing the dictionary encoding entirely
Re-encodes back to dictionary via cast() on emit
Misses the vectorized comparison/append optimizations in the GroupValuesColumn fast path

This is the same problem that PR #8291 attempted to solve by materializing dictionaries at the AggregateExec level. That approach was reverted because it changed the output schema (Dictionary(Int32, Utf8) → Utf8), which broke UNION queries with a RowConverter column schema mismatch error (#8738).

See also the related arrow-rs discussion on stateless row conversion (arrow-rs #4811, arrow-rs #4813, arrow-rs #4819).

What changes are included in this PR?

This PR takes a different approach: instead of changing schemas or materializing dictionaries at the plan level, it adds dictionary support internally to GroupValuesColumn via a thin wrapper that resolves dictionary keys on-demand.

How it differs from PR #8291

	PR #8291 (reverted)	This PR
Where	`AggregateExec` level — changes output schema	Inside `GroupValuesColumn` — completely internal
Schema impact	Output changed from `Dictionary` to value type	Output schema is unchanged — still `Dictionary`
Risk	Broke UNION queries due to schema mismatch	No schema changes visible outside `GroupValuesColumn`

Implementation

DictionaryGroupValueBuilder<K> — a new GroupColumn wrapper (~180 lines) that:

Receives DictionaryArray<K> inputs
Resolves dictionary keys to value-array indices on-demand (keys.value(row).as_usize())
Delegates to an inner value-type builder (e.g., ByteGroupValueBuilder for Utf8) for storage and comparison
Only copies data for rows that become new groups — no O(batch_size) materialization

The wrapper has fast paths for the common case (no null dictionary keys) where it remaps indices in bulk and delegates to the inner builder's vectorized operations.

make_group_column() — replaces the old instantiate_primitive! macro with a function that returns Result<Box<dyn GroupColumn>>. This naturally supports the recursive Dictionary wrapping (create inner builder for value type, wrap with DictionaryGroupValueBuilder) and simplifies intern() to a 3-line loop.

supported_type() — now accepts Dictionary(_, vt) where the value type is itself supported.

No changes to the emit path — it already re-encodes value-type arrays back to Dictionary via cast().

Files changed

File	Change
`multi_group_by/dictionary.rs`	New `DictionaryGroupValueBuilder<K>` wrapper
`multi_group_by/mod.rs`	`supported_type()`, `make_group_column()`, simplified `intern()`, 3 new tests
`group_values/mod.rs`	`mod row` → `pub mod row` (benchmark visibility)
`benches/dict_group_values.rs`	New benchmark
`Cargo.toml`	Register benchmark

Are these changes tested?

Yes:

test_supported_type_dictionary — verifies supported_type accepts/rejects dictionary types correctly
test_intern_with_dictionary_columns — multi-column grouping with Dictionary(Int32, Utf8) + Int64, overlapping groups across batches, verifies group indices and emit output types
test_dictionary_null_handling — null dictionary keys produce null groups, verified across batches
All 29 existing group_values unit tests continue to pass
Existing dictionary integration tests (aggregates::dict_nulls) pass

Are there any user-facing changes?

No. The output schema, output types, and query results are identical. Dictionary-encoded GROUP BY columns now route through GroupValuesColumn instead of GroupValuesRows, which is purely an internal performance optimization.

Benchmark results

Benchmark compares three paths for GROUP BY on Dictionary(Int32, Utf8) columns:

column_utf8: GroupValuesColumn with plain Utf8 (fast-path baseline)
column_dict: GroupValuesColumn with Dictionary — this PR
rows_dict: GroupValuesRows with Dictionary — current main

Parameters: 10 batches per iteration, cardinalities 50 / 1,000 / 10,000.

Scenario                                       column_utf8   column_dict (new)   rows_dict (current)   Speedup vs current
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
card_50    / batch_8192                           1.82 ms        1.51 ms              1.42 ms              0.94x
card_50    / batch_65536                         15.17 ms       12.21 ms             11.96 ms              0.98x
card_1000  / batch_8192                           1.84 ms        1.57 ms              1.83 ms              1.16x
card_1000  / batch_65536                         15.25 ms       12.72 ms             13.75 ms              1.08x
card_10000 / batch_8192                           2.25 ms        2.31 ms              3.68 ms              1.59x
card_10000 / batch_65536                         16.08 ms       14.59 ms             20.77 ms              1.42x

At low cardinality (50 distinct values), rows_dict and column_dict are comparable — the RowConverter is efficient when it has few values to intern. At moderate-to-high cardinality (1,000–10,000), column_dict is 1.1–1.6x faster than the current GroupValuesRows path, with the advantage growing as cardinality increases.

column_dict is also consistently faster than column_utf8 (plain strings) because the dictionary's compact values array is more cache-friendly than scanning the full batch.

Run with: cargo bench -p datafusion-physical-plan --bench dict_group_values --features test_utils

dictionary encoded group by's

c775004

github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dictionary encoded group by's#20563

dictionary encoded group by's#20563
alexanderbianchi wants to merge 1 commit intoapache:mainfrom
alexanderbianchi:bianchi/repartitiondictencoding

alexanderbianchi commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexanderbianchi commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How it differs from PR #8291

Implementation

Files changed

Are these changes tested?

Are there any user-facing changes?

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant