Skip to content

sketch out sync codecs + threadpool#3715

Draft
d-v-b wants to merge 17 commits intozarr-developers:mainfrom
d-v-b:perf/faster-codecs
Draft

sketch out sync codecs + threadpool#3715
d-v-b wants to merge 17 commits intozarr-developers:mainfrom
d-v-b:perf/faster-codecs

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 18, 2026

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026
@d-v-b d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,031.6 ms 270.8 ms ×3.8
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 554.3 ms 181.7 ms ×3.1
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 1,551.5 ms 684.4 ms ×2.3
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 2,111.7 ms 791.6 ms ×2.7
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5.5 s 1.8 s ×3.1
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.7 s 2.6 s ×3.7
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 1,204.9 ms 552.4 ms ×2.2
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5.5 s 1.8 s ×3.1
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.7 s 2.6 s ×3.7
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 2.7 s 1.3 s ×2
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory] 1,831.3 µs 662.2 µs ×2.8
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 278.1 ms 66.7 ms ×4.2
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 1,315 ms 532.1 ms ×2.5
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,631.2 ms 639.4 ms ×2.6
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 6 s 1.4 s ×4.2
WallTime test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 619.7 ms 143.9 ms ×4.3
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 2,886.8 ms 604.5 ms ×4.8
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 419.2 ms 99.4 ms ×4.2
WallTime test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 952.5 ms 228.6 ms ×4.2
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 3.2 s 1.5 s ×2.2
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

  • Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
  • We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
  • I haven't looked at memory use at all. that's probably a separate effort.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

the current performance improvements are without any parallelism. I'm adding that now.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job. needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments