Describe the bug
Title
Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias with statistics collection enabled
Description
I hit a panic in DataFusion 54.0.0 when querying a Parquet dataset from S3 and ordering by a computed projection alias.
The query is valid SQL and runs successfully in DuckDB on the same Parquet file. In DataFusion CLI, it panics during
statistics handling.
Environment
- DataFusion version: 54.0.0
- Interface: DataFusion CLI and Rust API
- Source: Parquet files on S3
- Dataset layout: Hive-style path segment, e.g.
s3:////partition_col=some_value/
The query does not reference partition_col.
The physical Parquet schema contains an id column.
Minimal SQL
CREATE EXTERNAL TABLE profile
STORED AS PARQUET
LOCATION 's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/';
SELECT
(((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM profile
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;
Observed panic
thread 'main' panicked at .../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
index out of bounds: the len is 0 but the index is 0
The relevant source location appears to be:
// datafusion-datasource-54.0.0/src/statistics.rs
if i < s.column_statistics.len() {
...
} else {
let partition_value = &pv[i - s.column_statistics.len()];
...
}
It looks like this path assumes that when a sort/statistics column index exceeds column_statistics.len(), the column must be
a partition column and therefore indexes into partition_values. In this case partition_values is empty, causing the panic.
Additional observation
The physical id column does have Parquet min/max statistics. I checked with DuckDB metadata:
SELECT
count(*) AS row_groups,
count(*) FILTER (WHERE stats_min IS NULL OR stats_max IS NULL) AS missing_minmax,
count(*) FILTER (WHERE stats_min IS NOT NULL AND stats_max IS NOT NULL) AS has_minmax
FROM parquet_metadata('s3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/*.parquet')
WHERE path_in_schema = 'id';
Result:
row_groups = 75
missing_minmax = 0
has_minmax = 75
So this does not appear to be caused by missing min/max statistics for id.
DuckDB comparison
The same query shape works in DuckDB:
SELECT
(((CAST(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM 's3://<redacted-bucket>/<redacted-prefix>/partition_col=some_value/<redacted-file>.parquet'
ORDER BY computed_bucket, CAST(id AS BIGINT)
LIMIT 10;
Expected behavior
DataFusion should not panic.
If statistics cannot be derived for the computed expression computed_bucket, I would expect DataFusion to fall back to
unknown/absent column statistics and continue executing the query.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
Title
Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias with statistics collection enabled
Description
I hit a panic in DataFusion 54.0.0 when querying a Parquet dataset from S3 and ordering by a computed projection alias.
The query is valid SQL and runs successfully in DuckDB on the same Parquet file. In DataFusion CLI, it panics during
statistics handling.
Environment
s3:////partition_col=some_value/
The query does not reference partition_col.
The physical Parquet schema contains an id column.
Minimal SQL
Observed panic
thread 'main' panicked at .../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
index out of bounds: the len is 0 but the index is 0
The relevant source location appears to be:
It looks like this path assumes that when a sort/statistics column index exceeds column_statistics.len(), the column must be
a partition column and therefore indexes into partition_values. In this case partition_values is empty, causing the panic.
Additional observation
The physical id column does have Parquet min/max statistics. I checked with DuckDB metadata:
Result:
row_groups = 75
missing_minmax = 0
has_minmax = 75
So this does not appear to be caused by missing min/max statistics for id.
DuckDB comparison
The same query shape works in DuckDB:
Expected behavior
DataFusion should not panic.
If statistics cannot be derived for the computed expression computed_bucket, I would expect DataFusion to fall back to
unknown/absent column statistics and continue executing the query.
To Reproduce
No response
Expected behavior
No response
Additional context
No response