perf:Add additional sharding benchmarks by mkitti · Pull Request #3712 · zarr-developers/zarr-python

mkitti · 2026-02-16T23:01:26Z

Summary

Added benchmarks for monitoring Morton order computation in sharded arrays. These benchmarks help assess the impact of Morton order optimizations in the context of I/O operations.

Benchmarks Added

test_sharded_morton_indexing - Sharded array indexing with power-of-2 chunks per shard
test_sharded_morton_indexing_large - Large shard with 32^3 = 32,768 chunks
test_sharded_morton_single_chunk - Reading a single chunk from a large shard
test_morton_order_iter - Direct benchmark of morton_order_iter (no I/O)
test_sharded_morton_write_single_chunk - Writing a single chunk to a large shard (best end-to-end test)

Benchmark Results

Single Chunk Write (Best End-to-End Test)

Writing a single 1x1x1 chunk to a shard with 32^3 = 32,768 chunks:

Branch	Mean Time	Improvement
Main (no optimization)	425ms	-
Optimized (PR #3708)	261ms	164ms (39% faster)

Morton Order Computation (Micro-benchmark)

Direct morton_order_iter benchmark without I/O:

Shape	Main Branch	Optimized	Speedup
(8, 8, 8)	2.73ms	0.85ms	3.2x
(16, 16, 16)	25.53ms	6.31ms	4.0x
(32, 32, 32)	229.25ms	51.31ms	4.5x

Profiling Analysis

Profile of single chunk write benchmark showing where time is spent:

Main Branch (977ms total)

Function	Time	Calls	% of Total
`decode_morton` (scalar)	289ms	32,768	30%
`get_chunk_slice`	104ms	32,768	11%
`_localize_chunk`	103ms	32,768	11%
`_morton_order`	99ms	1	10%
Generator expressions	94ms	262k	10%
`all()` / `len()`	87ms	263k	9%

Optimized Branch (456ms total)

Function	Time	Calls	% of Total
`get_chunk_slice`	110ms	32,768	24%
`_localize_chunk`	105ms	32,768	23%
`_morton_order`	66ms	1	14%
Generator expressions	38ms	131k	8%
`decode_morton_vectorized`	9ms	1	2%

Key Optimization Wins

Vectorized decoding: Eliminates 32,768 scalar decode_morton calls (289ms → 9ms)
Reduced bounds checking: Hypercube optimization eliminates all() checks for in-bounds coordinates
Fewer function calls: 1.1M calls reduced to 299k calls

Remaining Optimization Opportunity

get_chunk_slice and _localize_chunk are called 32,768 times even when writing a single chunk due to line 508 in sharding.py:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

This builds a dict of ALL chunks before writing. Optimizing this read-modify-write pattern could save an additional ~215ms.

Checklist

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

mkitti · 2026-02-17T04:30:43Z

If we wanted to minimize this pull request, I would reduce it to just "test_sharded_morton_write_single_chunk".

mkitti · 2026-02-17T22:58:07Z

@d-v-b merge or add benchmark label, please.

codspeed-hq · 2026-02-18T08:15:30Z

Merging this PR will not alter performance

✅ 48 untouched benchmarks
🆕 8 new benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	WallTime	`test_sharded_morton_indexing[(16, 16, 16)-memory]`	N/A	149.1 ms	N/A
🆕	WallTime	`test_sharded_morton_indexing[(32, 32, 32)-memory]`	N/A	1.2 s	N/A
🆕	WallTime	`test_morton_order_iter[(32, 32, 32)]`	N/A	498 ms	N/A
🆕	WallTime	`test_morton_order_iter[(8, 8, 8)]`	N/A	6.2 ms	N/A
🆕	WallTime	`test_sharded_morton_write_single_chunk[(32, 32, 32)-memory]`	N/A	948.8 ms	N/A
🆕	WallTime	`test_sharded_morton_single_chunk[(32, 32, 32)-memory]`	N/A	1.9 ms	N/A
🆕	WallTime	`test_morton_order_iter[(16, 16, 16)]`	N/A	56.1 ms	N/A
🆕	WallTime	`test_sharded_morton_indexing_large[(32, 32, 32)-memory]`	N/A	9.4 s	N/A

_{Comparing mkitti:mkitti-morton-order-shard-indexing-benchmarks (1fc17c7) with main (306e480)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

d-v-b · 2026-02-18T08:22:34Z

gpu test failures are unrelated (looks like cupy is not happy with the numpy void dtype)

mkitti added 3 commits February 13, 2026 15:51

test:Add sharding indexing benchmarks

34cf53e

test:Add morton_order_iter benchmark tests

81d87ef

tests:Add single chunk write test for sharding

e195b40

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026

Document changes

094bfbd

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 16, 2026

mkitti mentioned this pull request Feb 17, 2026

perf: Vectorize get_chunk_slice for faster sharded writes #3713

Open

6 tasks

d-v-b added the benchmark Code will be benchmarked in a CI job. label Feb 18, 2026

Merge branch 'main' into mkitti-morton-order-shard-indexing-benchmarks

1fc17c7

d-v-b approved these changes Feb 18, 2026

View reviewed changes

d-v-b merged commit 36caf1f into zarr-developers:main Feb 18, 2026
25 of 26 checks passed

d-v-b mentioned this pull request Feb 18, 2026

cuda test failures #3714

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf:Add additional sharding benchmarks#3712

perf:Add additional sharding benchmarks#3712
d-v-b merged 5 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-order-shard-indexing-benchmarks

mkitti commented Feb 16, 2026 •

edited

Loading

Uh oh!

mkitti commented Feb 17, 2026

Uh oh!

mkitti commented Feb 17, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

d-v-b commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mkitti commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks Added

Benchmark Results

Single Chunk Write (Best End-to-End Test)

Morton Order Computation (Micro-benchmark)

Profiling Analysis

Main Branch (977ms total)

Optimized Branch (456ms total)

Key Optimization Wins

Remaining Optimization Opportunity

Checklist

Uh oh!

mkitti commented Feb 17, 2026

Uh oh!

mkitti commented Feb 17, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

d-v-b commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkitti commented Feb 16, 2026 •

edited

Loading

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading