Add in-memory arrays to `ManifestStore`

**tl;dr: I think we should add in-memory arrays to `ManifestStore`, probably by generalizing the wrapped type to `ManifestArray | np.ndarray`.**

### Context

This library has evolved from managing only "virtual" xarray datasets into a design where `ManifestStore` really should be the primary entity, with [xarray as an optional dependency](https://github.com/zarr-developers/VirtualiZarr/issues/521). Icechunk and Kerchunk are already optional dependencies, and `ManifestStore` should be able to stand alone as [independently useful](https://github.com/zarr-developers/VirtualiZarr/issues/603).

However, `ManifestStore` is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:

| - | `ManifestStore` | "virtual" `xarray.Dataset` | Icechunk format | Kerchunk format |
| --- | --- | --- | --- | --- |
| external references | `ManifestArray`  | `xr.Variable` wrapping `ManifestArray` (a so-called "virtual variable") | virtual chunk ref | kerchunk reference stored in JSON |
| "inlined" references | ❌ | `xr.Variable` wrapping (lazy) `np.ndarray` (a so-called "loadable variable") | native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail) | encoded bytes stored in JSON |

This causes multiple problems:
- We can't roundtrip `kerchunk -> ManifestStore -> virtual xarray.Dataset -> kerchunk` if the original kerchunk file contains inlined references. (#489)
- Similarly it limits the utility of proposed `.to_icechunk`/`.to_kerchunk` methods on `ManifestStore`, as they can't handle all the cases that `ds.vz.to_icechunk/kerchunk` can. (#591). Withhout this xarray can't really be an optional dependency.
- It tempts us to hack around this using a memorystore (see https://github.com/zarr-developers/VirtualiZarr/issues/636), but I think this is the wrong design...

### Proposal: Add `numpy.ndarray` as a wrapped type in `ManifestStore`

Currently `ManifestStore` holds `ManifestGroups`, which have `._members: Mapping[str, "ManifestArray | ManifestGroup"]`. If we generalize this to `._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"]` we can store inlined chunks in memory in the `ManifestStore` as numpy arrays.

We would need to ensure that `xr.open_dataset(ManifestStore)` could still read these arrays, which might require something clever during `ManifestStore.get`.

`vz.open_virtual_dataset` (really just a backwards-compatibility wrapper around `ManifestStore.to_virtual_dataset()`) also needs to know that any array stored as a numpy array is to become a loadable variable.

This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - `open_virtual_dataset` has a problem in that its' `loadable_variables` kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things. 

Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.

### Alternative considered (but I vote against this): Use `obstore.MemoryStore`

In #636 and #794 it was suggested that we store inlined chunks in the `ManifestStore` by adding `obstore.MemoryStore` to the `ObjectStoreRegistry` and referencing the bytes via virtual chunk references of the form `memory://{location}`. This is neat in that you can read the bytes back directly using obstore and zarr-python.

The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that `ObjectStoreRegistry` can be any `obstore.Store`, and we treat all of them identically, no matter their contents. But we would instead be special-casing `obstore.MemoryStore` compared to other `obstore` Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has [`in_memory_storage`](https://icechunk.io/en/latest/reference/#icechunk.in_memory_storage) and supports virtual references beginning with `memory://` too). Special-casing `memory://` or some namespace within `memory://` breaks this abstraction.

Some other problems special-casing `MemoryStore` would cause:
- When calling `ManifestStore.to_icechunk` on an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish a `memory://` reference that represents an inlined chunk from a `memory://` reference that represents a virtual chunk?
- How would we be sure to avoid name collisions inside the `memory://` namespace? #794 does this by having the kerchunk parser write inlined data into `"memory://kerchunk/"`, but this seems like a leaky abstraction. That idea also means that the `ManifestStore` contents depend on the parser they were created by, which is also leaky.

cc @maxrjones @sharkinsspatial @ilan-gold @paraseba 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-memory arrays to `ManifestStore` #799

Context

Proposal: Add `numpy.ndarray` as a wrapped type in `ManifestStore`

Alternative considered (but I vote against this): Use `obstore.MemoryStore`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

-	`ManifestStore`	"virtual" `xarray.Dataset`	Icechunk format	Kerchunk format
external references	`ManifestArray`	`xr.Variable` wrapping `ManifestArray` (a so-called "virtual variable")	virtual chunk ref	kerchunk reference stored in JSON
"inlined" references	❌	`xr.Variable` wrapping (lazy) `np.ndarray` (a so-called "loadable variable")	native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail)	encoded bytes stored in JSON

Add in-memory arrays to ManifestStore #799

Description

Context

Proposal: Add numpy.ndarray as a wrapped type in ManifestStore

Alternative considered (but I vote against this): Use obstore.MemoryStore

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add in-memory arrays to `ManifestStore` #799

Proposal: Add `numpy.ndarray` as a wrapped type in `ManifestStore`

Alternative considered (but I vote against this): Use `obstore.MemoryStore`