feat: add shuffle size comparison benchmark [do not merge] by andygrove · Pull Request #3909 · apache/datafusion-comet

andygrove · 2026-04-08T18:05:28Z

Which issue does this PR close?

Relates to #3882.

Rationale for this change

Issue #3882 reports that Comet shuffle files can be significantly larger than Spark shuffle files due to per-batch Arrow IPC format overhead. To investigate and measure this, we need a benchmark that compares actual shuffle write bytes between Spark and Comet.

What changes are included in this PR?

Adds a shuffle-size PySpark benchmark that:

Runs a scan → repartition → write pipeline
Queries the Spark REST API to report shuffle write bytes and bytes/record
Integrates with the existing benchmark framework (run_benchmark.py)
Includes a convenience shell script (run_shuffle_size_benchmark.sh) that runs the benchmark in both Spark and Comet native modes for easy comparison

Usage:

# Generate test data
$SPARK_HOME/bin/spark-submit benchmarks/pyspark/generate_data.py --output /tmp/data --rows 200000000

# Run comparison
./benchmarks/pyspark/run_shuffle_size_benchmark.sh /tmp/data

How are these changes tested?

This is a benchmark script, not production code. Tested manually by running the benchmark with 204M rows (7 string + 1 timestamp columns) and comparing Spark vs Comet shuffle write sizes.

Add a PySpark benchmark that measures shuffle write bytes via the Spark REST API, making it easy to compare shuffle file sizes between Spark and Comet shuffle implementations.

- Add --schema short-strings option to generate_data.py that produces 7 short random UUID string columns + 1 timestamp, matching the schema from issue apache#3882 - Update shuffle_size.py to measure actual shuffle .data file sizes on disk via spark.local.dir, in addition to the REST API metric - Update run_shuffle_size_benchmark.sh with dedicated local dirs per run, driver memory, and shuffle enable config

feat: add shuffle size comparison benchmark for apache#3882

296f338

Add a PySpark benchmark that measures shuffle write bytes via the Spark REST API, making it easy to compare shuffle file sizes between Spark and Comet shuffle implementations.

andygrove mentioned this pull request Apr 8, 2026

Current shuffle format has too much overhead with default batch size #3882

Open

andygrove changed the title ~~feat: add shuffle size comparison benchmark~~ feat: add shuffle size comparison benchmark [do not merge] Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add shuffle size comparison benchmark [do not merge]#3909

feat: add shuffle size comparison benchmark [do not merge]#3909
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-size-benchmark

andygrove commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant