Skip to content

feat: Add expression-keyed statistics for struct field pruning#20589

Draft
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:expression-keyed-statistics-struct-pruning
Draft

feat: Add expression-keyed statistics for struct field pruning#20589
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:expression-keyed-statistics-struct-pruning

Conversation

@adriangb
Copy link
Contributor

Summary

  • Adds StatisticsKey enum and expression_statistics HashMap to Statistics for expression-keyed statistics lookup
  • Teaches PruningPredicate to recognize get_field chains and rewrite them to synthetic dotted column names for statistics-based pruning
  • Populates struct leaf column statistics from Parquet metadata into expression_statistics
  • Enables file-level pruning for predicates on struct fields like WHERE col['a'] > 5

Motivation

Parquet stores struct fields (e.g., col.a inside struct col) as separate physical leaf columns with their own min/max/null statistics. DataFusion's Statistics struct uses column_statistics: Vec<ColumnStatistics> indexed 1:1 with schema fields, so struct field stats had nowhere to live. This blocked file-level pruning for predicates like WHERE get_field(col, 'a') > 5.

Design

By storing statistics keyed by expressions (StatisticsKey::FieldPath), the pruning system doesn't need to understand each expression type — it just checks "does this sub-expression have stats?" and uses them directly via dotted column name resolution.

Key changes

File Changes
datafusion/common/src/stats.rs StatisticsKey enum, expression_statistics field, updated helper methods
datafusion/common/src/pruning.rs PrunableStatistics falls back to expression_statistics for dotted column names
datafusion/datasource-parquet/src/metadata.rs Populates expression_statistics for struct leaf columns from Parquet metadata
datafusion/pruning/src/pruning_predicate.rs get_field chain recognition, dotted name rewriting, schema hierarchy resolution
datafusion/proto-common/src/from_proto/mod.rs Backward-compatible deserialization

Test plan

  • All existing tests pass (cargo test -p datafusion-common && cargo test -p datafusion-pruning && cargo test -p datafusion-datasource-parquet)
  • Full workspace compilation clean
  • Integration test: Create Parquet file with struct column, verify expression_statistics is populated, verify file pruning based on struct field predicates

🤖 Generated with Claude Code

Parquet stores struct fields as separate physical leaf columns with their
own min/max/null statistics. DataFusion's Statistics struct uses
column_statistics indexed 1:1 with schema fields, so struct field stats
had nowhere to live. This blocked file-level pruning for predicates like
WHERE get_field(col, 'a') > 5.

This adds StatisticsKey-keyed statistics so any expression (columns,
field paths, etc.) can carry ColumnStatistics, then populates struct
field stats from Parquet metadata and teaches PruningPredicate to use
them.

Changes:
- Add StatisticsKey enum (Column, FieldPath) to stats.rs
- Add expression_statistics HashMap to Statistics struct
- Update Statistics methods (project, to_inexact, try_merge, with_fetch)
- Add get_field chain recognition in PruningPredicate
- Add dotted-name resolution for nested schema fields
- Teach PrunableStatistics to fall back to expression_statistics
- Populate struct field stats from Parquet leaf column metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Feb 27, 2026
@adriangb adriangb marked this pull request as draft February 27, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant