feat: Add expression-keyed statistics for struct field pruning#20589
Draft
adriangb wants to merge 1 commit intoapache:mainfrom
Draft
feat: Add expression-keyed statistics for struct field pruning#20589adriangb wants to merge 1 commit intoapache:mainfrom
adriangb wants to merge 1 commit intoapache:mainfrom
Conversation
Parquet stores struct fields as separate physical leaf columns with their own min/max/null statistics. DataFusion's Statistics struct uses column_statistics indexed 1:1 with schema fields, so struct field stats had nowhere to live. This blocked file-level pruning for predicates like WHERE get_field(col, 'a') > 5. This adds StatisticsKey-keyed statistics so any expression (columns, field paths, etc.) can carry ColumnStatistics, then populates struct field stats from Parquet metadata and teaches PruningPredicate to use them. Changes: - Add StatisticsKey enum (Column, FieldPath) to stats.rs - Add expression_statistics HashMap to Statistics struct - Update Statistics methods (project, to_inexact, try_merge, with_fetch) - Add get_field chain recognition in PruningPredicate - Add dotted-name resolution for nested schema fields - Teach PrunableStatistics to fall back to expression_statistics - Populate struct field stats from Parquet leaf column metadata Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
StatisticsKeyenum andexpression_statisticsHashMap toStatisticsfor expression-keyed statistics lookupPruningPredicateto recognizeget_fieldchains and rewrite them to synthetic dotted column names for statistics-based pruningexpression_statisticsWHERE col['a'] > 5Motivation
Parquet stores struct fields (e.g.,
col.ainside structcol) as separate physical leaf columns with their own min/max/null statistics. DataFusion'sStatisticsstruct usescolumn_statistics: Vec<ColumnStatistics>indexed 1:1 with schema fields, so struct field stats had nowhere to live. This blocked file-level pruning for predicates likeWHERE get_field(col, 'a') > 5.Design
By storing statistics keyed by expressions (
StatisticsKey::FieldPath), the pruning system doesn't need to understand each expression type — it just checks "does this sub-expression have stats?" and uses them directly via dotted column name resolution.Key changes
datafusion/common/src/stats.rsStatisticsKeyenum,expression_statisticsfield, updated helper methodsdatafusion/common/src/pruning.rsPrunableStatisticsfalls back toexpression_statisticsfor dotted column namesdatafusion/datasource-parquet/src/metadata.rsexpression_statisticsfor struct leaf columns from Parquet metadatadatafusion/pruning/src/pruning_predicate.rsget_fieldchain recognition, dotted name rewriting, schema hierarchy resolutiondatafusion/proto-common/src/from_proto/mod.rsTest plan
cargo test -p datafusion-common && cargo test -p datafusion-pruning && cargo test -p datafusion-datasource-parquet)expression_statisticsis populated, verify file pruning based on struct field predicates🤖 Generated with Claude Code