Skip to content

feat: add shuffle size comparison benchmark [do not merge]#3909

Draft
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-size-benchmark
Draft

feat: add shuffle size comparison benchmark [do not merge]#3909
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:shuffle-size-benchmark

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Relates to #3882.

Rationale for this change

Issue #3882 reports that Comet shuffle files can be significantly larger than Spark shuffle files due to per-batch Arrow IPC format overhead. To investigate and measure this, we need a benchmark that compares actual shuffle write bytes between Spark and Comet.

What changes are included in this PR?

Adds a shuffle-size PySpark benchmark that:

  • Runs a scan → repartition → write pipeline
  • Queries the Spark REST API to report shuffle write bytes and bytes/record
  • Integrates with the existing benchmark framework (run_benchmark.py)
  • Includes a convenience shell script (run_shuffle_size_benchmark.sh) that runs the benchmark in both Spark and Comet native modes for easy comparison

Usage:

# Generate test data
$SPARK_HOME/bin/spark-submit benchmarks/pyspark/generate_data.py --output /tmp/data --rows 200000000

# Run comparison
./benchmarks/pyspark/run_shuffle_size_benchmark.sh /tmp/data

How are these changes tested?

This is a benchmark script, not production code. Tested manually by running the benchmark with 204M rows (7 string + 1 timestamp columns) and comparing Spark vs Comet shuffle write sizes.

Add a PySpark benchmark that measures shuffle write bytes via the
Spark REST API, making it easy to compare shuffle file sizes between
Spark and Comet shuffle implementations.
@andygrove andygrove changed the title feat: add shuffle size comparison benchmark feat: add shuffle size comparison benchmark [do not merge] Apr 8, 2026
- Add --schema short-strings option to generate_data.py that produces
  7 short random UUID string columns + 1 timestamp, matching the schema
  from issue apache#3882
- Update shuffle_size.py to measure actual shuffle .data file sizes on
  disk via spark.local.dir, in addition to the REST API metric
- Update run_shuffle_size_benchmark.sh with dedicated local dirs per run,
  driver memory, and shuffle enable config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant