Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 0 additions & 23 deletions benchmarks/Dockerfile

This file was deleted.

227 changes: 158 additions & 69 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,88 +17,177 @@ specific language governing permissions and limitations
under the License.
-->

# Running Comet Benchmarks in Microk8s
# Comet Benchmark Suite

Unified benchmark infrastructure for Apache DataFusion Comet. Supports
TPC-H/TPC-DS and shuffle benchmarks across multiple engines (Spark, Comet,
Gluten) with composable configuration and optional memory profiling.

## Quick Start

```bash
# Run TPC-H with Comet on a standalone cluster
python benchmarks/run.py \
--engine comet --profile standalone-tpch --restart-cluster \
-- tpc --benchmark tpch --data $TPCH_DATA --queries $TPCH_QUERIES \
--output . --iterations 1

# Preview the spark-submit command without executing
python benchmarks/run.py \
--engine comet --profile standalone-tpch --dry-run \
-- tpc --benchmark tpch --data $TPCH_DATA --queries $TPCH_QUERIES \
--output . --iterations 1
```

## Directory Layout

```
benchmarks/
├── run.py # Entry point — builds and runs spark-submit
├── conf/
│ ├── engines/ # Per-engine configs (comet, spark, gluten, ...)
│ └── profiles/ # Per-environment configs (local, standalone, docker)
├── runner/
│ ├── cli.py # Python CLI passed to spark-submit (subcommands: tpc, shuffle, micro)
│ ├── config.py # Config file loader and merger
│ ├── spark_session.py # SparkSession builder
│ └── profiling.py # Level 1 JVM metrics via Spark REST API
├── suites/
│ ├── tpc.py # TPC-H / TPC-DS benchmark suite
│ ├── shuffle.py # Shuffle benchmark suite (hash, round-robin)
│ └── micro.py # Microbenchmark suite (string expressions, ...)
├── analysis/
│ ├── compare.py # Generate comparison charts from result JSON
│ └── memory_report.py # Generate memory reports from profiling CSV
├── infra/
│ ├── docker/ # Dockerfile, docker-compose, metrics collector
├── create-iceberg-tpch.py # Utility: convert TPC-H Parquet to Iceberg tables
└── drop-caches.sh # Utility: drop OS page caches before benchmarks
```

This guide explains how to run benchmarks derived from TPC-H and TPC-DS in Apache DataFusion Comet deployed in a
local Microk8s cluster.
## How It Works

## Use Microk8s locally
`run.py` is the single entry point. It:

Install Micro8s following the instructions at https://microk8s.io/docs/getting-started and then perform these
additional steps, ensuring that any existing kube config is backed up first.
1. Reads a **profile** config (cluster shape, memory, master URL)
2. Reads an **engine** config (plugin JARs, shuffle manager, engine-specific settings)
3. Applies any `--conf key=value` CLI overrides (highest precedence)
4. Builds and executes the `spark-submit` command

```shell
mkdir -p ~/.kube
microk8s config > ~/.kube/config
The merge order is: **profile < engine < CLI overrides**, so engine configs
can override profile defaults (e.g., an engine can set `offHeap.enabled=false`
even though the profile enables it).

microk8s enable dns
microk8s enable registry
### Wrapper arguments (before `--`)

microk8s kubectl create serviceaccount spark
```
| Flag | Description |
| ------------------- | ----------------------------------------------- |
| `--engine NAME` | Engine config from `conf/engines/NAME.conf` |
| `--profile NAME` | Profile config from `conf/profiles/NAME.conf` |
| `--conf key=value` | Extra Spark/runner config override (repeatable) |
| `--restart-cluster` | Stop/start Spark standalone master + worker |
| `--dry-run` | Print spark-submit command without executing |

### Suite arguments (after `--`)

Everything after `--` is passed to `runner/cli.py`. See per-suite docs:

- [TPC-H / TPC-DS](suites/TPC.md)
- [Shuffle](suites/SHUFFLE.md)
- [Microbenchmarks](suites/MICRO.md)

## Available Engines

| Engine | Config file | Description |
| ---------------------- | ----------------------------------- | --------------------------------- |
| `spark` | `engines/spark.conf` | Vanilla Spark (no accelerator) |
| `comet` | `engines/comet.conf` | DataFusion Comet with native scan |
| `comet-iceberg` | `engines/comet-iceberg.conf` | Comet + native Iceberg scanning |
| `gluten` | `engines/gluten.conf` | Gluten (Velox backend) — Java 8 |
| `spark-shuffle` | `engines/spark-shuffle.conf` | Spark baseline for shuffle tests |
| `comet-jvm-shuffle` | `engines/comet-jvm-shuffle.conf` | Comet with JVM shuffle mode |
| `comet-native-shuffle` | `engines/comet-native-shuffle.conf` | Comet with native shuffle |

## Available Profiles

## Build Comet Docker Image
| Profile | Config file | Description |
| ------------------ | -------------------------------- | ------------------------------ |
| `local` | `profiles/local.conf` | `local[*]` mode, no cluster |
| `standalone-tpch` | `profiles/standalone-tpch.conf` | 1 executor, 8 cores, S3A |
| `standalone-tpcds` | `profiles/standalone-tpcds.conf` | 2 executors, 16 cores, S3A |
| `docker` | `profiles/docker.conf` | For docker-compose deployments |

Run the following command from the root of this repository to build the Comet Docker image, or use a published
Docker image from https://github.com/orgs/apache/packages?repo_name=datafusion-comet
## Environment Variables

```shell
docker build -t apache/datafusion-comet -f kube/Dockerfile .
The config files use `${VAR}` references that are expanded from the
environment at load time:

| Variable | Used by | Description |
| -------------- | -------------------- | --------------------------------- |
| `SPARK_HOME` | `run.py` | Path to Spark installation |
| `SPARK_MASTER` | standalone profiles | Spark master URL |
| `COMET_JAR` | comet engines | Path to Comet JAR |
| `GLUTEN_JAR` | gluten engine | Path to Gluten JAR |
| `ICEBERG_JAR` | comet-iceberg engine | Path to Iceberg Spark runtime JAR |

## Profiling

Add `--profile` (the flag, not the config) to any suite command to enable
Level 1 JVM metrics collection via the Spark REST API:

```bash
python benchmarks/run.py --engine comet --profile standalone-tpch \
-- tpc --benchmark tpch --data $TPCH_DATA --queries $TPCH_QUERIES \
--output . --iterations 1 --profile --profile-interval 1.0
```

## Build Comet Benchmark Docker Image
This writes a `{name}-{benchmark}-metrics.csv` alongside the result JSON.

For container-level memory profiling, use the constrained docker-compose
overlay — see [Docker infrastructure](infra/docker/).

## Generating Charts

Build the benchmark Docker image and push to the Microk8s Docker registry.
```bash
# Compare two result JSON files
python -m benchmarks.analysis.compare \
comet-tpch-*.json spark-tpch-*.json \
--labels Comet Spark --benchmark tpch \
--title "TPC-H SF100" --output-dir ./charts

```shell
docker build -t apache/datafusion-comet-tpcbench .
docker tag apache/datafusion-comet-tpcbench localhost:32000/apache/datafusion-comet-tpcbench:latest
docker push localhost:32000/apache/datafusion-comet-tpcbench:latest
# Generate memory reports
python -m benchmarks.analysis.memory_report \
--spark-csv comet-tpch-metrics.csv \
--container-csv container-metrics.csv \
--output-dir ./charts
```

## Run benchmarks

```shell
export SPARK_MASTER=k8s://https://127.0.0.1:16443
export COMET_DOCKER_IMAGE=localhost:32000/apache/datafusion-comet-tpcbench:latest
# Location of Comet JAR within the Docker image
export COMET_JAR=/opt/spark/jars/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar

$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--deploy-mode cluster \
--name comet-tpcbench \
--driver-memory 8G \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=32G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.task.cpus=1 \
--conf spark.executor.memoryOverhead=3G \
--jars local://$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.pod.name=tpcbench \
--conf spark.kubernetes.container.image=$COMET_DOCKER_IMAGE \
--conf spark.kubernetes.driver.volumes.hostPath.tpcdata.mount.path=/mnt/bigdata/tpcds/sf100/ \
--conf spark.kubernetes.driver.volumes.hostPath.tpcdata.options.path=/mnt/bigdata/tpcds/sf100/ \
--conf spark.kubernetes.executor.volumes.hostPath.tpcdata.mount.path=/mnt/bigdata/tpcds/sf100/ \
--conf spark.kubernetes.executor.volumes.hostPath.tpcdata.options.path=/mnt/bigdata/tpcds/sf100/ \
--conf spark.kubernetes.authenticate.caCertFile=/var/snap/microk8s/current/certs/ca.crt \
local:///opt/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
--benchmark tpcds \
--data /mnt/bigdata/tpcds/sf100/ \
--queries /opt/datafusion-benchmarks/tpcds/queries-spark \
--iterations 1
## Running in Docker

See [infra/docker/](infra/docker/) for docker-compose setup with optional
memory-constrained overlays and cgroup metrics collection.

The Docker image includes both Java 8 and Java 17 runtimes. Java 17 is the
default (`JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64`), which is required
by Comet. Gluten requires Java 8, so override `JAVA_HOME` for all containers
when running Gluten benchmarks:

```bash
# Start the cluster with Java 8 for Gluten
docker compose -f benchmarks/infra/docker/docker-compose.yml up -d

# Run Gluten benchmark (override JAVA_HOME on all containers)
docker compose -f benchmarks/infra/docker/docker-compose.yml run --rm \
-e JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
-e GLUTEN_JAR=/jars/gluten.jar \
bench bash -c 'python3 /opt/benchmarks/run.py \
--engine gluten --profile docker \
-- tpc --name gluten --benchmark tpch --data /data \
--queries /queries --output /results --iterations 1'
```

> **Note:** The Spark worker must also run Java 8 for Gluten. Use a
> docker-compose override file to set `JAVA_HOME` on `spark-master` and
> `spark-worker` services before starting the cluster, or restart the
> cluster between engine switches.

16 changes: 16 additions & 0 deletions benchmarks/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
Loading
Loading