Skip to content

perf: Vectorize get_chunk_slice for faster sharded writes#3713

Open
mkitti wants to merge 20 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization
Open

perf: Vectorize get_chunk_slice for faster sharded writes#3713
mkitti wants to merge 20 commits intozarr-developers:mainfrom
mkitti:mkitti-get-chunk-slice-vectorization

Conversation

@mkitti
Copy link

@mkitti mkitti commented Feb 17, 2026

Summary

This PR adds vectorized methods to _ShardIndex and _ShardReader for batch chunk slice lookups, significantly reducing per-chunk function call overhead when writing to shards.

Changes

New Methods

_ShardIndex.get_chunk_slices_vectorized: Batch lookup of chunk slices using NumPy vectorized operations instead of per-chunk Python calls.

_ShardReader.to_dict_vectorized: Build a chunk dictionary using vectorized lookup instead of iterating with individual get() calls.

Modified Code Path

In _encode_partial_single, replaced:

shard_dict = {k: shard_reader.get(k) for k in morton_order_iter(chunks_per_shard)}

With vectorized approach:

morton_coords = _morton_order(chunks_per_shard)
chunk_coords_array = np.array(morton_coords, dtype=np.uint64)
shard_dict = shard_reader.to_dict_vectorized(chunk_coords_array, morton_coords)

Benchmark Results

Single Chunk Write to Large Shard

Writing a single 1x1x1 chunk to a shard with 32³ chunks (using test_sharded_morton_write_single_chunk from PR #3712):

Optimization Time Speedup vs Main
Main branch (original) 422ms -
+ Morton optimization (PR #3708) 261ms 1.6x
+ Vectorized get_chunk_slice 95ms 4.4x

Profile Breakdown

Function Before After
get_chunk_slice + _localize_chunk 215ms 3ms
to_dict_vectorized loop 81ms 9ms
Total function calls 299k 37k

Checklist

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

mkitti and others added 20 commits February 13, 2026 00:15
Add benchmarks that clear the _morton_order LRU cache before each
iteration to measure the full Morton computation cost:

- test_sharded_morton_indexing: 512-4096 chunks per shard
- test_sharded_morton_indexing_large: 32768 chunks per shard

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vectorized methods to _ShardIndex and _ShardReader for batch
chunk slice lookups, reducing per-chunk function call overhead
when writing to shards.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants