Skip to content

Support glob patterns in open_datatree(group_filter=...) for selective group loading#11302

Open
aladinor wants to merge 32 commits into
pydata:mainfrom
aladinor:glob-group-filtering-standalone
Open

Support glob patterns in open_datatree(group_filter=...) for selective group loading#11302
aladinor wants to merge 32 commits into
pydata:mainfrom
aladinor:glob-group-filtering-standalone

Conversation

@aladinor
Copy link
Copy Markdown
Contributor

Summary

When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.

Use cases

  • Radar data: xr.open_datatree("radar.nc", group="*/sweep_0") — load only the lowest elevation sweep from each volume scan
  • CMIP archives: xr.open_datatree("cmip.zarr", group="*/historical/tas") — load only temperature across all models

Changes

  • Added shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py
  • Updated NetCDF4, H5NetCDF, and Zarr backends to use a discover → filter → open pipeline
  • Uses the same matching engine as DataTree.match() (PurePosixPath.match)
  • Root (/) and all ancestors of matched nodes are always included to form a valid tree

Behavior summary

group value Behavior
None Load all groups (unchanged)
"VCP-34" (no glob chars) Root selection (unchanged)
"*/sweep_0" (glob chars) Filter mode — only matched groups + ancestors
Pattern matches nothing Root-only tree

Test plan

  • 27 new tests covering all backends (netCDF4, h5netcdf, zarr v2/v3)
  • Unit tests for _is_glob_pattern, _filter_group_paths, _resolve_group_and_filter with *, ?, []
  • Integration tests: glob match, no-match, data preservation, open_groups API
  • Full test_backends_datatree.py suite passes (228 passed, 0 failures)
  • Pre-commit checks pass

@github-actions github-actions Bot added topic-backends topic-zarr Related to zarr storage library io labels Apr 16, 2026
Add _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter
to common.py for detecting and applying glob patterns to group paths.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Update docstrings for the group kwarg in open_datatree and open_groups
to describe glob metacharacter behavior.
Add integration tests for netCDF4, h5netcdf, and zarr backends, plus
unit tests for _is_glob_pattern, _filter_group_paths, and
_resolve_group_and_filter covering *, ?, and [] metacharacters.
@aladinor aladinor force-pushed the glob-group-filtering-standalone branch from e892524 to 5fb46e1 Compare April 16, 2026 17:09
@kmuehlbauer
Copy link
Copy Markdown
Contributor

@aladinor Thanks, that's great a feature. I'd instantly use it.

There might be some pitfalls if group names are containing one or more of the glob meta characters. Will this be handled, too?

my_nifty_group_with_a_star_*_01
my_nifty_group_with_a_star_*_11
my_nifty_group_with_a_star_*_12

@kmuehlbauer
Copy link
Copy Markdown
Contributor

XRef: h5py/h5py#2059 for discussion of adding globbing in h5py

@aladinor
Copy link
Copy Markdown
Contributor Author

aladinor commented Apr 22, 2026

@kmuehlbauer, thanks for taking the time to check this out.

my_nifty_group_with_a_star_01
my_nifty_group_with_a_star
11
my_nifty_group_with_a_star
*_12

This seems to be a strange way to name a group, but yes. It will work via the same character-class escape that fnmatch / PurePath.match supports.

For example, if we have something like this

  paths = ['/my_nifty_group_with_a_star_*_01',
           '/my_nifty_group_with_a_star_*_11',                                                                                                                                                                         
           '/my_nifty_group_with_a_star_*_12']      

We can use this pattern to get those groups "*star_[*]_*". This will match all 3. literal * via [*]

aladinor and others added 8 commits April 22, 2026 08:40
Add coverage for group names containing literal ``*`` / ``?`` / ``[``.
These are reachable with ``[*]`` / ``[?]`` / ``[[]`` character-class
escaping (inherited from ``fnmatch`` / ``PurePath.match`` semantics).

New tests:
- ``test_open_datatree_glob_char_class_escape_literal_metachar`` on
  ``NetCDFIOBase`` and ``TestZarrDatatreeIO`` — end-to-end verification
  that groups with literal metacharacters in their names can be
  targeted across all supported backends.
- ``test_filter_group_paths_literal_metachar_via_char_class`` on
  ``TestGlobPatternUtilities`` — unit-level check of the filter.
Explain that matching follows ``fnmatch`` / :py:meth:`pathlib.PurePath.match`
semantics and that literal ``*`` / ``?`` / ``[`` in group names can be
targeted via character-class escapes (``[*]``, ``[?]``, ``[[]``), with a
short example. Applied to both :py:func:`open_datatree` and
:py:func:`open_groups` for consistency.
Add ``/plain_01`` to the zarr ``test_open_datatree_glob_char_class_escape_literal_metachar``
fixture so it matches the NetCDF version and confirms plain (no-metachar)
group names are excluded when the pattern targets literal-metachar names.
Windows forbids ``*`` and ``?`` in filesystem directory/file names, and
zarr stores each group as an on-disk directory. That makes writing the
fixture impossible before the test can exercise the filter. NetCDF4/H5
store groups inside the HDF5 container so they are unaffected.

Skip the zarr variant on Windows with a clear reason; the NetCDF
variants still cover the escape behavior on all platforms.
The previous commit skipped the zarr variant on Windows because the
filesystem rejects ``*`` and ``?`` in directory names. Using
``zarr.storage.MemoryStore`` side-steps the filesystem entirely, so the
test now runs on every platform and still exercises the escape logic.

This is also a more realistic target for the feature on Windows — users
who hit group names with glob metacharacters are likely reading from
cloud/icechunk stores (dict-keyed like ``MemoryStore``), not an on-disk
zarr directory tree.
``open_datatree``'s static signature doesn't list zarr store objects
(``MemoryStore`` etc.) among its accepted first-argument types, but the
zarr backend handles them correctly at runtime. Apply a narrow
``# type: ignore[arg-type]`` on the three test calls rather than
widening the public signature.
@kmuehlbauer
Copy link
Copy Markdown
Contributor

@aladinor Thanks for adding the glob escapes. Is this ready from your side?

@aladinor
Copy link
Copy Markdown
Contributor Author

aladinor commented May 8, 2026

Yep, it is ready to merge @kmuehlbauer

@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label May 8, 2026
Copy link
Copy Markdown
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Can't say much wrt typing, though.

@kmuehlbauer
Copy link
Copy Markdown
Contributor

@pydata/xarray Another set of eyes much appreciated here. If there are no concerns, I'd move on and merge early next week. Thanks!

@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 11, 2026

I like the idea of this feature, but worry about ambiguity with the existing group argument -- are we sure that names with these characters are invalidate in netCDF/Zarr?

A safer strategy would be to make a new argument, something like group_filter.

@kmuehlbauer kmuehlbauer removed the plan to merge Final call for comments label May 12, 2026
@kmuehlbauer
Copy link
Copy Markdown
Contributor

Thanks Stephan, yes, we can't resolve the ambiguity. Unfortunately, this is also a breaking change. But, from all experience those glob characters are not that frequent. Nevertheless they are valid for HDF5 and zarr AFAIK.

We could use a new argument as @shoyer suggested, like this:

xr.open_datatree(
    "file.zarr",
    group="*/historical/tas",
    group_filter="glob" # defaults to "exact" or the like
)

We also might think about a new group selector, like this:

xr.open_datatree(
    "file.zarr",
    group=GroupSelector("*/historical/tas", mode="glob"),
)

GroupSelector would be for power users. And there will likely follow more use cases, exclude, max_depth immediately come to mind.

So how should this be moved forward?

@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 12, 2026 via email

@kmuehlbauer
Copy link
Copy Markdown
Contributor

Thanks for clarifying my misunderstanding. So group and group_filter would be mutually exclusive.

aladinor added 15 commits May 28, 2026 10:30
Remove the dispatcher ``_resolve_group_and_filter`` and the
``_is_glob_pattern`` glob detector (the latter only existed to feed
the auto-detect logic in backends).

Keep two narrow helpers:

- ``_check_group_filter_mutex(group, group_filter)`` — raises
  ``ValueError`` if both are set. Single-purpose validator.
- ``_filter_group_paths(paths, pattern)`` — returns the subset of
  paths matching ``pattern`` plus every ancestor needed to keep the
  resulting tree connected. The root path ``"/"`` is always included.

Backends now call these directly: a mutex check at the top of each
``open_groups_as_dict`` (covers both public-API and backend-direct
callers), then a conditional ``_filter_group_paths`` call after the
group walk.
Replace the glob-detection-in-``group=`` docstring with explicit
``group_filter`` documentation. The two parameters are mutually
exclusive: ``group`` opens an exact group path; ``group_filter`` is
a glob pattern matched against every group path in the file.

The mutex check lives in the backend (so backend-direct callers are
also protected); no api.py-level enforcement needed.
Replace the previous auto-detect (``_is_glob_pattern(group)``) with
an explicit ``group_filter`` parameter. ``group`` keeps its current
meaning of an exact path to re-root the tree at; ``group_filter`` is
a separate glob pattern.

Mutex check sits at the top of the method so backend-direct callers
get the same guarantee as public-API callers. When ``group_filter``
is set, the store is opened at root and the discovered paths are
filtered before group iteration.
Mirror the h5netcdf change: replace the auto-detect with an explicit
``group_filter`` parameter, do the mutex check at the top, and apply
the filter to the discovered paths after walking from the chosen
parent group.
Same shape as the netCDF backends: explicit ``group_filter``
parameter, mutex check at the top of ``open_groups_as_dict``, and
the discovered paths from ``ZarrStore.open_store`` are filtered when
a pattern is set. ``open_datatree`` forwards ``group_filter`` to the
inner ``open_groups_as_dict`` call.
Tracks the refactor in PR pydata#11302 from auto-detected glob in `group=`
to an explicit `group_filter=` kwarg (mutually exclusive with `group=`).

- Rename and migrate the 8 backend glob tests (`group="*/sweep_0"` →
  `group_filter="*/sweep_0"`) in `NetCDFIOBase` and `TestZarrDatatreeIO`.
- Rename and migrate the character-class escape tests to `group_filter=`.
- Add `test_open_datatree_group_with_literal_metachar` — write groups
  literally named `weird_*_name` alongside siblings a glob would also
  match (`weird_X_name`, `weird_Y_name`), then open with exact `group=`
  and assert the data payload to prove literal-path semantics.
- Add `test_open_datatree_group_and_group_filter_mutually_exclusive`
  covering both `open_datatree` and `open_groups` entry points.
- Rename `TestGlobPatternUtilities` → `TestGroupFilterHelpers`; drop
  `test_is_glob_pattern` and the three `_resolve_group_and_filter`
  tests (helpers removed in Phase 1).
- Add parametrized `_check_group_filter_mutex` tests, including
  empty-string cases that pin the `is not None` semantics and guard
  against a refactor to plain truthiness.
Update the PR pydata#11302 entry under "New Features" to reflect the
post-review API: an explicit `group_filter` kwarg (mutually exclusive
with `group`) replaces auto-detected globs in `group=`. Add a one-line
note about character-class escapes (`[*]`, `[?]`) for group names that
literally contain `*` or `?`.
Add group_filter: str | None = None to
H5netcdfBackendEntrypoint.open_datatree so the signature mirrors the
zarr backend and is visible to IDE / inspect.signature. The kwarg
already flowed through **kwargs to open_groups_as_dict, so this
is purely a discoverability fix.

Drop the group=None if group_filter else group ternary at the
H5NetCDFStore.open call. The mutex check earlier in the function
already guarantees the two cannot both be set, making the conditional
unreachable and asymmetric with the zarr backend.
Mirror the h5netcdf change: surface group_filter on
NetCDF4BackendEntrypoint.open_datatree (was reaching the backend via
**kwargs only), and drop the unreachable
group=None if group_filter else group ternary at the
NetCDF4DataStore.open call — the mutex check upstream rules out
the both-set case.
Move the group_filter filter step from open_groups_as_dict
into ZarrStore.open_store so paths are pruned *before* the
per-group zarr_group[rel_path] lookup at the materialization loop.
Each lookup triggers metadata I/O against the Zarr store; for large
hierarchies where only a handful of groups match the filter, opening
the store-per-group up-front was making group_filter cost as much
as opening the whole tree.

With the push-down, only matched paths (plus their ancestors, via
_filter_group_paths) trigger the materialization lookup. The
caller-side filter in open_groups_as_dict becomes redundant and
is removed.
The previous wording claimed ``group_filter="*/sweep_0"`` loaded
matches "one level deep". That is wrong: ``pathlib.PurePath.match``
is anchored on the right, so ``*/leaf_0`` matches at any depth where
the trailing two segments line up. Rewrite the bullet to describe the
real semantics and note that the pattern must be non-empty (now enforced
in ``_check_group_filter_mutex``).
Three audit-driven cleanups to the helpers in
``xarray/backends/common.py``:

- ``_check_group_filter_mutex`` now also raises ``ValueError`` when
  ``group_filter`` is the empty string. Previously
  ``group_filter=""`` passed the mutex check but then crashed
  inside ``NodePath.match("")`` with an opaque ``pathlib`` error;
  reject up-front with a clear message.
- ``_filter_group_paths`` parameter retyped from ``Iterable[str]``
  to ``Sequence[str]``. The body iterates the parameter twice, so
  passing a generator would silently return ``[]``; ``Sequence``
  pins the actual contract.
- Drop the dead ``if str(p)`` guard inside the parents loop. For
  absolute paths, ``NodePath(...).parents`` never yields empty
  segments.
- Extend the docstring to spell out the right-anchored
  ``pathlib.PurePath.match`` semantics with a worked example.

Note: this commit transiently breaks
``test_check_group_filter_mutex_passes[None-]`` until the test file
is updated in the next commit (per-file commit cadence).
Audit-driven test improvements for the group_filter refactor:

Tightening
- Replace ``in`` / ``not in`` membership checks with ``==`` set
  equality so over-inclusion regressions are caught.

New coverage
- ``test_open_datatree_group_filter_match_is_right_anchored``:
  fixture with nested ``/x/y/z/leaf_0``-style depths to pin the
  documented right-anchored ``NodePath.match`` semantics.
- ``test_filter_group_paths_match_is_right_anchored``,
  ``test_filter_group_paths_leading_slash_pattern``, and
  ``test_filter_group_paths_recursive_glob``: helper-level pins
  for fully anchored patterns (``/A/leaf_0``) and ``**``
  match-everything.
- ``test_check_group_filter_mutex_rejects_empty_pattern``: pins the
  new ``group_filter=""`` rejection (paired with the helper change
  in the previous commit).

Zarr parity
- Add ``_preserves_data``, ``_with_literal_metachar``, and
  ``_and_group_filter_mutually_exclusive`` to
  ``TestZarrDatatreeIO``. NetCDFIOBase already had these via
  inheritance.

Rename
- ``sweep_*`` → ``leaf_*`` throughout. The original naming was a
  carryover from the radar use case that motivated the PR; ``leaf_*``
  is domain-agnostic and self-describing for the test hierarchy.

Mutex parametrize
- Drop ``(None, "")`` from the passes parametrize (now raises).
Domain-agnostic example matching the test naming.
@aladinor
Copy link
Copy Markdown
Contributor Author

Thanks @kmuehlbauer and @shoyer for your feedback. I've already changed to group_filter. Let me know if you have additional comments

@aladinor aladinor changed the title Support glob patterns in open_datatree(group=...) for selective group loading Support glob patterns in open_datatree(group_filter=...) for selective group loading May 28, 2026
``NodePath('/leaf_0').match('*/leaf_0')`` returns ``True`` on
Python 3.11 (pathlib treats the leading ``/`` as a segment the
``*`` can consume) but ``False`` on Python 3.13. The test was
probing exactly this corner case and broke CI on py311.

Drop the root-level ``/leaf_0`` from the fixture so the test only
asserts the version-stable behavior: ``*/leaf_0`` matches at depth
2+, where the ``*`` consistently consumes a non-root parent
segment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support glob patterns in open_datatree(group=...) for selective group loading

3 participants