You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: I think we should add in-memory arrays to ManifestStore, probably by generalizing the wrapped type to ManifestArray | np.ndarray.
Context
This library has evolved from managing only "virtual" xarray datasets into a design where ManifestStore really should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, and ManifestStore should be able to stand alone as independently useful.
However, ManifestStore is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:
-
ManifestStore
"virtual" xarray.Dataset
Icechunk format
Kerchunk format
external references
ManifestArray
xr.Variable wrapping ManifestArray (a so-called "virtual variable")
virtual chunk ref
kerchunk reference stored in JSON
"inlined" references
❌
xr.Variable wrapping (lazy) np.ndarray (a so-called "loadable variable")
native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail)
Similarly it limits the utility of proposed .to_icechunk/.to_kerchunk methods on ManifestStore, as they can't handle all the cases that ds.vz.to_icechunk/kerchunk can. (Add .to_icechunk() method to ManifestGroup #591). Withhout this xarray can't really be an optional dependency.
Proposal: Add numpy.ndarray as a wrapped type in ManifestStore
Currently ManifestStore holds ManifestGroups, which have ._members: Mapping[str, "ManifestArray | ManifestGroup"]. If we generalize this to ._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"] we can store inlined chunks in memory in the ManifestStore as numpy arrays.
We would need to ensure that xr.open_dataset(ManifestStore) could still read these arrays, which might require something clever during ManifestStore.get.
vz.open_virtual_dataset (really just a backwards-compatibility wrapper around ManifestStore.to_virtual_dataset()) also needs to know that any array stored as a numpy array is to become a loadable variable.
This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - open_virtual_dataset has a problem in that its' loadable_variables kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.
Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.
Alternative considered (but I vote against this): Use obstore.MemoryStore
In #636 and #794 it was suggested that we store inlined chunks in the ManifestStore by adding obstore.MemoryStore to the ObjectStoreRegistry and referencing the bytes via virtual chunk references of the form memory://{location}. This is neat in that you can read the bytes back directly using obstore and zarr-python.
The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that ObjectStoreRegistry can be any obstore.Store, and we treat all of them identically, no matter their contents. But we would instead be special-casing obstore.MemoryStore compared to other obstore Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has in_memory_storage and supports virtual references beginning with memory:// too). Special-casing memory:// or some namespace within memory:// breaks this abstraction.
Some other problems special-casing MemoryStore would cause:
When calling ManifestStore.to_icechunk on an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish a memory:// reference that represents an inlined chunk from a memory:// reference that represents a virtual chunk?
How would we be sure to avoid name collisions inside the memory:// namespace? Support in-lined Kerchunk references #794 does this by having the kerchunk parser write inlined data into "memory://kerchunk/", but this seems like a leaky abstraction. That idea also means that the ManifestStore contents depend on the parser they were created by, which is also leaky.
tl;dr: I think we should add in-memory arrays to
ManifestStore, probably by generalizing the wrapped type toManifestArray | np.ndarray.Context
This library has evolved from managing only "virtual" xarray datasets into a design where
ManifestStorereally should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, andManifestStoreshould be able to stand alone as independently useful.However,
ManifestStoreis currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:ManifestStorexarray.DatasetManifestArrayxr.VariablewrappingManifestArray(a so-called "virtual variable")xr.Variablewrapping (lazy)np.ndarray(a so-called "loadable variable")This causes multiple problems:
kerchunk -> ManifestStore -> virtual xarray.Dataset -> kerchunkif the original kerchunk file contains inlined references. (Error reading inlined reference data when trying to roundtrip virtual dataset #489).to_icechunk/.to_kerchunkmethods onManifestStore, as they can't handle all the cases thatds.vz.to_icechunk/kerchunkcan. (Add.to_icechunk()method toManifestGroup#591). Withhout this xarray can't really be an optional dependency.Proposal: Add
numpy.ndarrayas a wrapped type inManifestStoreCurrently
ManifestStoreholdsManifestGroups, which have._members: Mapping[str, "ManifestArray | ManifestGroup"]. If we generalize this to._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"]we can store inlined chunks in memory in theManifestStoreas numpy arrays.We would need to ensure that
xr.open_dataset(ManifestStore)could still read these arrays, which might require something clever duringManifestStore.get.vz.open_virtual_dataset(really just a backwards-compatibility wrapper aroundManifestStore.to_virtual_dataset()) also needs to know that any array stored as a numpy array is to become a loadable variable.This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense -
open_virtual_datasethas a problem in that its'loadable_variableskwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.
Alternative considered (but I vote against this): Use
obstore.MemoryStoreIn #636 and #794 it was suggested that we store inlined chunks in the
ManifestStoreby addingobstore.MemoryStoreto theObjectStoreRegistryand referencing the bytes via virtual chunk references of the formmemory://{location}. This is neat in that you can read the bytes back directly using obstore and zarr-python.The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that
ObjectStoreRegistrycan be anyobstore.Store, and we treat all of them identically, no matter their contents. But we would instead be special-casingobstore.MemoryStorecompared to otherobstoreStore types. We also want to be able to treat all icechunk stores identically (note Icechunk hasin_memory_storageand supports virtual references beginning withmemory://too). Special-casingmemory://or some namespace withinmemory://breaks this abstraction.Some other problems special-casing
MemoryStorewould cause:ManifestStore.to_icechunkon an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish amemory://reference that represents an inlined chunk from amemory://reference that represents a virtual chunk?memory://namespace? Support in-lined Kerchunk references #794 does this by having the kerchunk parser write inlined data into"memory://kerchunk/", but this seems like a leaky abstraction. That idea also means that theManifestStorecontents depend on the parser they were created by, which is also leaky.cc @maxrjones @sharkinsspatial @ilan-gold @paraseba