Skip to content

perf:Add additional sharding benchmarks#3712

Merged
d-v-b merged 5 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-order-shard-indexing-benchmarks
Feb 18, 2026
Merged

perf:Add additional sharding benchmarks#3712
d-v-b merged 5 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-order-shard-indexing-benchmarks

Conversation

@mkitti
Copy link
Contributor

@mkitti mkitti commented Feb 16, 2026

Summary

Added benchmarks for monitoring Morton order computation in sharded arrays. These benchmarks help assess the impact of Morton order optimizations in the context of I/O operations.

Benchmarks Added

  • test_sharded_morton_indexing - Sharded array indexing with power-of-2 chunks per shard
  • test_sharded_morton_indexing_large - Large shard with 32^3 = 32,768 chunks
  • test_sharded_morton_single_chunk - Reading a single chunk from a large shard
  • test_morton_order_iter - Direct benchmark of morton_order_iter (no I/O)
  • test_sharded_morton_write_single_chunk - Writing a single chunk to a large shard (best end-to-end test)

Benchmark Results

Single Chunk Write (Best End-to-End Test)

Writing a single 1x1x1 chunk to a shard with 32^3 = 32,768 chunks:

Branch Mean Time Improvement
Main (no optimization) 425ms -
Optimized (PR #3708) 261ms 164ms (39% faster)

Morton Order Computation (Micro-benchmark)

Direct morton_order_iter benchmark without I/O:

Shape Main Branch Optimized Speedup
(8, 8, 8) 2.73ms 0.85ms 3.2x
(16, 16, 16) 25.53ms 6.31ms 4.0x
(32, 32, 32) 229.25ms 51.31ms 4.5x

Profiling Analysis

Profile of single chunk write benchmark showing where time is spent:

Main Branch (977ms total)

Function Time Calls % of Total
decode_morton (scalar) 289ms 32,768 30%
get_chunk_slice 104ms 32,768 11%
_localize_chunk 103ms 32,768 11%
_morton_order 99ms 1 10%
Generator expressions 94ms 262k 10%
all() / len() 87ms 263k 9%

Optimized Branch (456ms total)

Function Time Calls % of Total
get_chunk_slice 110ms 32,768 24%
_localize_chunk 105ms 32,768 23%
_morton_order 66ms 1 14%
Generator expressions 38ms 131k 8%
decode_morton_vectorized 9ms 1 2%

Key Optimization Wins

  1. Vectorized decoding: Eliminates 32,768 scalar decode_morton calls (289ms → 9ms)
  2. Reduced bounds checking: Hypercube optimization eliminates all() checks for in-bounds coordinates
  3. Fewer function calls: 1.1M calls reduced to 299k calls

Remaining Optimization Opportunity

get_chunk_slice and _localize_chunk are called 32,768 times even when writing a single chunk due to line 508 in sharding.py:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

This builds a dict of ALL chunks before writing. Optimizing this read-modify-write pattern could save an additional ~215ms.

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026
@mkitti
Copy link
Contributor Author

mkitti commented Feb 17, 2026

If we wanted to minimize this pull request, I would reduce it to just "test_sharded_morton_write_single_chunk".

@mkitti
Copy link
Contributor Author

mkitti commented Feb 17, 2026

@d-v-b merge or add benchmark label, please.

@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 18, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will not alter performance

✅ 48 untouched benchmarks
🆕 8 new benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 WallTime test_sharded_morton_indexing[(16, 16, 16)-memory] N/A 149.1 ms N/A
🆕 WallTime test_sharded_morton_indexing[(32, 32, 32)-memory] N/A 1.2 s N/A
🆕 WallTime test_morton_order_iter[(32, 32, 32)] N/A 498 ms N/A
🆕 WallTime test_morton_order_iter[(8, 8, 8)] N/A 6.2 ms N/A
🆕 WallTime test_sharded_morton_write_single_chunk[(32, 32, 32)-memory] N/A 948.8 ms N/A
🆕 WallTime test_sharded_morton_single_chunk[(32, 32, 32)-memory] N/A 1.9 ms N/A
🆕 WallTime test_morton_order_iter[(16, 16, 16)] N/A 56.1 ms N/A
🆕 WallTime test_sharded_morton_indexing_large[(32, 32, 32)-memory] N/A 9.4 s N/A

Comparing mkitti:mkitti-morton-order-shard-indexing-benchmarks (1fc17c7) with main (306e480)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 18, 2026

gpu test failures are unrelated (looks like cupy is not happy with the numpy void dtype)

@d-v-b d-v-b merged commit 36caf1f into zarr-developers:main Feb 18, 2026
25 of 26 checks passed
@d-v-b d-v-b mentioned this pull request Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants