Conversation
This PR implements morsel-driven execution for Parquet files in DataFusion, enabling row-group level work sharing across partitions to mitigate data skew. Key changes: - Introduced `WorkQueue` in `datafusion/datasource/src/file_stream.rs` for shared pool of work. - Added `morselize` method to `FileOpener` trait to allow dynamic splitting of files into morsels. - Implemented `morselize` for `ParquetOpener` to split files into individual row groups. - Cached `ParquetMetaData` in `ParquetMorsel` extensions to avoid redundant I/O. - Modified `FileStream` to support work stealing from the shared queue. - Implemented `Weak` pointer pattern for `WorkQueue` in `FileScanConfig` to support plan re-executability. - Added `MorselizingGuard` to ensure shared state consistency on cancellation. - Added `allow_morsel_driven` configuration option (enabled by default for Parquet). - Implemented row-group pruning during the morselization phase for better efficiency. Tests: - Added `parquet_morsel_driven_execution` test to verify work distribution and re-executability. - Added `parquet_morsel_driven_enabled_by_default` to verify the default configuration. Co-authored-by: Dandandan <163737+Dandandan@users.noreply.github.com>
…en-execution-237164415184908839
|
🤖 |
|
Benchmark script failed with exit code 101. Last 10 lines of output: Click to expand |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
These tests directly query per-partition statistics from DataSourceExec, which returns unknown stats when wrapped by MergePartitionsExec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
show benchmark queue |
|
🤖 Hi @Dandandan, you asked to view the benchmark queue (#20600 (comment)).
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark clickbench_extended |
|
run benchmarks |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?