sketch out sync codecs + threadpool by d-v-b · Pull Request #3715 · zarr-developers/zarr-python

d-v-b · 2026-02-18T20:51:17Z

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

codspeed-hq · 2026-02-18T21:12:53Z

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,031.6 ms	270.8 ms	×3.8
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	554.3 ms	181.7 ms	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,551.5 ms	684.4 ms	×2.3
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2,111.7 ms	791.6 ms	×2.7
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	1,204.9 ms	552.4 ms	×2.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.7 s	1.3 s	×2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory]`	1,831.3 µs	662.2 µs	×2.8
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	278.1 ms	66.7 ms	×4.2
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	1,315 ms	532.1 ms	×2.5
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,631.2 ms	639.4 ms	×2.6
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	6 s	1.4 s	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	619.7 ms	143.9 ms	×4.3
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2,886.8 ms	604.5 ms	×4.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	419.2 ms	99.4 ms	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	952.5 ms	228.6 ms	×4.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	1.5 s	×2.2
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…thon into perf/faster-codecs

d-v-b · 2026-02-19T10:53:00Z

docs/design/sync-bypass.md

@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

d-v-b · 2026-02-19T12:49:49Z

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
I haven't looked at memory use at all. that's probably a separate effort.

d-v-b · 2026-02-19T13:06:24Z

the current performance improvements are without any parallelism. I'm adding that now.

d-v-b · 2026-02-19T13:53:27Z

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

sketch out sync codecs + threadpool

f427898

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026

Merge branch 'main' into perf/faster-codecs

dbdc3d4

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added 5 commits February 19, 2026 08:45

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

d-v-b commented Feb 19, 2026

View reviewed changes

d-v-b added 4 commits February 19, 2026 12:29

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

d-v-b added 5 commits February 19, 2026 15:03

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sketch out sync codecs + threadpool#3715

sketch out sync codecs + threadpool#3715
d-v-b wants to merge 17 commits intozarr-developers:mainfrom
d-v-b:perf/faster-codecs

d-v-b commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

d-v-b Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Uh oh!

Conversation

d-v-b commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×5

Performance Changes

Footnotes

Uh oh!

d-v-b Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading