Support glob patterns in open_datatree(group_filter=...) for selective group loading#11302
Support glob patterns in open_datatree(group_filter=...) for selective group loading#11302aladinor wants to merge 32 commits into
Conversation
Add _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter to common.py for detecting and applying glob patterns to group paths.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob patterns in the group parameter for selective group loading.
Update docstrings for the group kwarg in open_datatree and open_groups to describe glob metacharacter behavior.
Add integration tests for netCDF4, h5netcdf, and zarr backends, plus unit tests for _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter covering *, ?, and [] metacharacters.
e892524 to
5fb46e1
Compare
|
@aladinor Thanks, that's great a feature. I'd instantly use it. There might be some pitfalls if group names are containing one or more of the glob meta characters. Will this be handled, too? |
|
XRef: h5py/h5py#2059 for discussion of adding globbing in h5py |
|
@kmuehlbauer, thanks for taking the time to check this out.
This seems to be a strange way to name a group, but yes. It will work via the same character-class escape that For example, if we have something like this paths = ['/my_nifty_group_with_a_star_*_01',
'/my_nifty_group_with_a_star_*_11',
'/my_nifty_group_with_a_star_*_12'] We can use this pattern to get those groups |
Add coverage for group names containing literal ``*`` / ``?`` / ``[``. These are reachable with ``[*]`` / ``[?]`` / ``[[]`` character-class escaping (inherited from ``fnmatch`` / ``PurePath.match`` semantics). New tests: - ``test_open_datatree_glob_char_class_escape_literal_metachar`` on ``NetCDFIOBase`` and ``TestZarrDatatreeIO`` — end-to-end verification that groups with literal metacharacters in their names can be targeted across all supported backends. - ``test_filter_group_paths_literal_metachar_via_char_class`` on ``TestGlobPatternUtilities`` — unit-level check of the filter.
Explain that matching follows ``fnmatch`` / :py:meth:`pathlib.PurePath.match` semantics and that literal ``*`` / ``?`` / ``[`` in group names can be targeted via character-class escapes (``[*]``, ``[?]``, ``[[]``), with a short example. Applied to both :py:func:`open_datatree` and :py:func:`open_groups` for consistency.
Add ``/plain_01`` to the zarr ``test_open_datatree_glob_char_class_escape_literal_metachar`` fixture so it matches the NetCDF version and confirms plain (no-metachar) group names are excluded when the pattern targets literal-metachar names.
Windows forbids ``*`` and ``?`` in filesystem directory/file names, and zarr stores each group as an on-disk directory. That makes writing the fixture impossible before the test can exercise the filter. NetCDF4/H5 store groups inside the HDF5 container so they are unaffected. Skip the zarr variant on Windows with a clear reason; the NetCDF variants still cover the escape behavior on all platforms.
The previous commit skipped the zarr variant on Windows because the filesystem rejects ``*`` and ``?`` in directory names. Using ``zarr.storage.MemoryStore`` side-steps the filesystem entirely, so the test now runs on every platform and still exercises the escape logic. This is also a more realistic target for the feature on Windows — users who hit group names with glob metacharacters are likely reading from cloud/icechunk stores (dict-keyed like ``MemoryStore``), not an on-disk zarr directory tree.
``open_datatree``'s static signature doesn't list zarr store objects (``MemoryStore`` etc.) among its accepted first-argument types, but the zarr backend handles them correctly at runtime. Apply a narrow ``# type: ignore[arg-type]`` on the three test calls rather than widening the public signature.
|
@aladinor Thanks for adding the glob escapes. Is this ready from your side? |
|
Yep, it is ready to merge @kmuehlbauer |
kmuehlbauer
left a comment
There was a problem hiding this comment.
This is looking good to me. Can't say much wrt typing, though.
|
@pydata/xarray Another set of eyes much appreciated here. If there are no concerns, I'd move on and merge early next week. Thanks! |
|
I like the idea of this feature, but worry about ambiguity with the existing A safer strategy would be to make a new argument, something like |
|
Thanks Stephan, yes, we can't resolve the ambiguity. Unfortunately, this is also a breaking change. But, from all experience those glob characters are not that frequent. Nevertheless they are valid for HDF5 and zarr AFAIK. We could use a new argument as @shoyer suggested, like this: xr.open_datatree(
"file.zarr",
group="*/historical/tas",
group_filter="glob" # defaults to "exact" or the like
)We also might think about a new group selector, like this: xr.open_datatree(
"file.zarr",
group=GroupSelector("*/historical/tas", mode="glob"),
)
So how should this be moved forward? |
|
I was thinking something like group_filter="*/historical/tas" instead of
group=, which is not too verbose.
…On Tue, May 12, 2026 at 1:06 AM Kai Mühlbauer ***@***.***> wrote:
*kmuehlbauer* left a comment (pydata/xarray#11302)
<#11302 (comment)>
Thanks Stephan, yes, we can't resolve the ambiguity. Unfortunately, this
is also a breaking change. But, from all experience those glob characters
are not that frequent. Nevertheless they are valid for HDF5 and zarr AFAIK.
We could use a new argument as @shoyer <https://github.com/shoyer>
suggested, like this:
xr.open_datatree(
"file.zarr",
group="*/historical/tas",
group_filter="glob" # defaults to "exact" or the like
)
We also might think about a new group selector, like this:
xr.open_datatree(
"file.zarr",
group=GroupSelector("*/historical/tas", mode="glob"),
)
GroupSelector would be for power users. And there will likely follow more
use cases, exclude, max_depth immediately come to mind.
So how should this be moved forward?
—
Reply to this email directly, view it on GitHub
<#11302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVUNOBF6MZPWQK2ZYZD42LLQDAVCNFSM6AAAAACX33YNISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMRYGUZDOOJXHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Thanks for clarifying my misunderstanding. So |
Remove the dispatcher ``_resolve_group_and_filter`` and the ``_is_glob_pattern`` glob detector (the latter only existed to feed the auto-detect logic in backends). Keep two narrow helpers: - ``_check_group_filter_mutex(group, group_filter)`` — raises ``ValueError`` if both are set. Single-purpose validator. - ``_filter_group_paths(paths, pattern)`` — returns the subset of paths matching ``pattern`` plus every ancestor needed to keep the resulting tree connected. The root path ``"/"`` is always included. Backends now call these directly: a mutex check at the top of each ``open_groups_as_dict`` (covers both public-API and backend-direct callers), then a conditional ``_filter_group_paths`` call after the group walk.
Replace the glob-detection-in-``group=`` docstring with explicit ``group_filter`` documentation. The two parameters are mutually exclusive: ``group`` opens an exact group path; ``group_filter`` is a glob pattern matched against every group path in the file. The mutex check lives in the backend (so backend-direct callers are also protected); no api.py-level enforcement needed.
Replace the previous auto-detect (``_is_glob_pattern(group)``) with an explicit ``group_filter`` parameter. ``group`` keeps its current meaning of an exact path to re-root the tree at; ``group_filter`` is a separate glob pattern. Mutex check sits at the top of the method so backend-direct callers get the same guarantee as public-API callers. When ``group_filter`` is set, the store is opened at root and the discovered paths are filtered before group iteration.
Mirror the h5netcdf change: replace the auto-detect with an explicit ``group_filter`` parameter, do the mutex check at the top, and apply the filter to the discovered paths after walking from the chosen parent group.
Same shape as the netCDF backends: explicit ``group_filter`` parameter, mutex check at the top of ``open_groups_as_dict``, and the discovered paths from ``ZarrStore.open_store`` are filtered when a pattern is set. ``open_datatree`` forwards ``group_filter`` to the inner ``open_groups_as_dict`` call.
Tracks the refactor in PR pydata#11302 from auto-detected glob in `group=` to an explicit `group_filter=` kwarg (mutually exclusive with `group=`). - Rename and migrate the 8 backend glob tests (`group="*/sweep_0"` → `group_filter="*/sweep_0"`) in `NetCDFIOBase` and `TestZarrDatatreeIO`. - Rename and migrate the character-class escape tests to `group_filter=`. - Add `test_open_datatree_group_with_literal_metachar` — write groups literally named `weird_*_name` alongside siblings a glob would also match (`weird_X_name`, `weird_Y_name`), then open with exact `group=` and assert the data payload to prove literal-path semantics. - Add `test_open_datatree_group_and_group_filter_mutually_exclusive` covering both `open_datatree` and `open_groups` entry points. - Rename `TestGlobPatternUtilities` → `TestGroupFilterHelpers`; drop `test_is_glob_pattern` and the three `_resolve_group_and_filter` tests (helpers removed in Phase 1). - Add parametrized `_check_group_filter_mutex` tests, including empty-string cases that pin the `is not None` semantics and guard against a refactor to plain truthiness.
Update the PR pydata#11302 entry under "New Features" to reflect the post-review API: an explicit `group_filter` kwarg (mutually exclusive with `group`) replaces auto-detected globs in `group=`. Add a one-line note about character-class escapes (`[*]`, `[?]`) for group names that literally contain `*` or `?`.
Add group_filter: str | None = None to H5netcdfBackendEntrypoint.open_datatree so the signature mirrors the zarr backend and is visible to IDE / inspect.signature. The kwarg already flowed through **kwargs to open_groups_as_dict, so this is purely a discoverability fix. Drop the group=None if group_filter else group ternary at the H5NetCDFStore.open call. The mutex check earlier in the function already guarantees the two cannot both be set, making the conditional unreachable and asymmetric with the zarr backend.
Mirror the h5netcdf change: surface group_filter on NetCDF4BackendEntrypoint.open_datatree (was reaching the backend via **kwargs only), and drop the unreachable group=None if group_filter else group ternary at the NetCDF4DataStore.open call — the mutex check upstream rules out the both-set case.
Move the group_filter filter step from open_groups_as_dict into ZarrStore.open_store so paths are pruned *before* the per-group zarr_group[rel_path] lookup at the materialization loop. Each lookup triggers metadata I/O against the Zarr store; for large hierarchies where only a handful of groups match the filter, opening the store-per-group up-front was making group_filter cost as much as opening the whole tree. With the push-down, only matched paths (plus their ancestors, via _filter_group_paths) trigger the materialization lookup. The caller-side filter in open_groups_as_dict becomes redundant and is removed.
The previous wording claimed ``group_filter="*/sweep_0"`` loaded matches "one level deep". That is wrong: ``pathlib.PurePath.match`` is anchored on the right, so ``*/leaf_0`` matches at any depth where the trailing two segments line up. Rewrite the bullet to describe the real semantics and note that the pattern must be non-empty (now enforced in ``_check_group_filter_mutex``).
Three audit-driven cleanups to the helpers in
``xarray/backends/common.py``:
- ``_check_group_filter_mutex`` now also raises ``ValueError`` when
``group_filter`` is the empty string. Previously
``group_filter=""`` passed the mutex check but then crashed
inside ``NodePath.match("")`` with an opaque ``pathlib`` error;
reject up-front with a clear message.
- ``_filter_group_paths`` parameter retyped from ``Iterable[str]``
to ``Sequence[str]``. The body iterates the parameter twice, so
passing a generator would silently return ``[]``; ``Sequence``
pins the actual contract.
- Drop the dead ``if str(p)`` guard inside the parents loop. For
absolute paths, ``NodePath(...).parents`` never yields empty
segments.
- Extend the docstring to spell out the right-anchored
``pathlib.PurePath.match`` semantics with a worked example.
Note: this commit transiently breaks
``test_check_group_filter_mutex_passes[None-]`` until the test file
is updated in the next commit (per-file commit cadence).
Audit-driven test improvements for the group_filter refactor: Tightening - Replace ``in`` / ``not in`` membership checks with ``==`` set equality so over-inclusion regressions are caught. New coverage - ``test_open_datatree_group_filter_match_is_right_anchored``: fixture with nested ``/x/y/z/leaf_0``-style depths to pin the documented right-anchored ``NodePath.match`` semantics. - ``test_filter_group_paths_match_is_right_anchored``, ``test_filter_group_paths_leading_slash_pattern``, and ``test_filter_group_paths_recursive_glob``: helper-level pins for fully anchored patterns (``/A/leaf_0``) and ``**`` match-everything. - ``test_check_group_filter_mutex_rejects_empty_pattern``: pins the new ``group_filter=""`` rejection (paired with the helper change in the previous commit). Zarr parity - Add ``_preserves_data``, ``_with_literal_metachar``, and ``_and_group_filter_mutually_exclusive`` to ``TestZarrDatatreeIO``. NetCDFIOBase already had these via inheritance. Rename - ``sweep_*`` → ``leaf_*`` throughout. The original naming was a carryover from the radar use case that motivated the PR; ``leaf_*`` is domain-agnostic and self-describing for the test hierarchy. Mutex parametrize - Drop ``(None, "")`` from the passes parametrize (now raises).
Domain-agnostic example matching the test naming.
|
Thanks @kmuehlbauer and @shoyer for your feedback. I've already changed to |
``NodePath('/leaf_0').match('*/leaf_0')`` returns ``True`` on
Python 3.11 (pathlib treats the leading ``/`` as a segment the
``*`` can consume) but ``False`` on Python 3.13. The test was
probing exactly this corner case and broke CI on py311.
Drop the root-level ``/leaf_0`` from the fixture so the test only
asserts the version-stable behavior: ``*/leaf_0`` matches at depth
2+, where the ``*`` consistently consumes a non-root parent
segment.
Summary
When the
groupparameter contains glob metacharacters (*,?,[), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.Use cases
xr.open_datatree("radar.nc", group="*/sweep_0")— load only the lowest elevation sweep from each volume scanxr.open_datatree("cmip.zarr", group="*/historical/tas")— load only temperature across all modelsChanges
_is_glob_pattern,_filter_group_paths, and_resolve_group_and_filterincommon.pyDataTree.match()(PurePosixPath.match)/) and all ancestors of matched nodes are always included to form a valid treeBehavior summary
groupvalueNone"VCP-34"(no glob chars)"*/sweep_0"(glob chars)open_datatree(group=...)for selective group loading #11196whats-new.rstapi.rstTest plan
_is_glob_pattern,_filter_group_paths,_resolve_group_and_filterwith*,?,[]open_groupsAPItest_backends_datatree.pysuite passes (228 passed, 0 failures)