Expand `Dataset.from_files` so it works properly with derived variables by schlunma · Pull Request #2777 · ESMValGroup/ESMValCore

schlunma · 2025-07-16T16:08:56Z

Description

This PR expands Dataset.from_files so it works properly with derived variables. In addition, a new attribute Dataset.input_datasets is available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within the derive preprocessor function.

This PR is the second step to make Dataset.load work with derived variables.

Example

dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets")  # Found 36 datasets

dataset = datasets[0]
dataset.files  # []

for d in dataset.input_datasets:
    print(d["short_name"])
    print(d.files)

# rlut
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlut/gn/v20200623/rlut_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]
# rlutcs
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlutcs/gn/v20200623/rlutcs_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de']]

Related to #2769.

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

…es_with_derived_vars

bouweandela

Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

bouweandela · 2025-12-19T14:01:56Z

esmvalcore/dataset.py

+        return input_datasets
+
+    @property
+    def input_datasets(self) -> list[Dataset]:


Can we rename this to derived_from or something similar?

Renamed to required_datasets in ff0cdd5. Would that be okay for you? I think derived_from might be misleading for non-derived variables.

bouweandela · 2025-12-19T14:18:24Z

esmvalcore/_recipe/to_datasets.py

-    return not copy.files
-
-
 def _get_input_datasets(dataset: Dataset) -> list[Dataset]:


Is this function still needed now that the dataset provides these as an attribute?

Yes. This function removes non-existent optional required datasets prior to loading them. This can/will be moved to the Dataset.load() (see #2769).

bouweandela · 2025-12-19T14:18:44Z

esmvalcore/_recipe/to_datasets.py

+    return input_datasets


 def _representative_datasets(dataset: Dataset) -> list[Dataset]:


This function seems no longer needed either

Again, this can be removed once Dataset.load() works properly for derived variables.

esmvalcore/dataset.py

bouweandela · 2025-12-19T15:11:42Z

esmvalcore/dataset.py

+            all_datasets: list[list[tuple[dict, Dataset]]] = []
+            for input_dataset in self._get_input_datasets():
+                all_datasets.append([])
+                for expanded_ds in self._get_available_datasets(
+                    input_dataset,
+                ):
+                    updated_facets = {}
+                    for key, value in self.facets.items():
+                        if _isglob(value):
+                            if key in expanded_ds.facets and not _isglob(
+                                expanded_ds[key],
+                            ):
+                                updated_facets[key] = expanded_ds.facets[key]
+                    new_ds = self.copy()
+                    new_ds.facets.update(updated_facets)
+                    new_ds.supplementaries = self.supplementaries
+
+                    all_datasets[-1].append((updated_facets, new_ds))
+
+            # Only consider those datasets that contain all input variables
+            # necessary for derivation
+            for updated_facets, new_ds in all_datasets[0]:
+                other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]]
+                if all(updated_facets in facets for facets in other_facets):
+                    yield new_ds
+                else:
+                    logger.debug(
+                        "Not all necessary input variables to derive '%s' are "
+                        "available for %s with facets %s",
+                        self["short_name"],
+                        new_ds.summary(shorten=True),
+                        updated_facets,
+                    )


This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?

If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?

I tried to simplify this in d8f5d08 and 0962489. I think it should be robust. If there's any mismatch in the facets at all (apart from the variable names and other obvious ones), those are not considered.

bouweandela · 2025-12-19T15:23:04Z

esmvalcore/dataset.py

+    def _get_all_available_datasets(self) -> Iterator[Dataset]:  # noqa: C901
+        """Yield datasets based on the available files.
+
+        This function requires that self.facets['mip'] is not a glob pattern.


Is this still the case?

I honestly don't know, just copy-pasted this from the existing code. Since the parent function calling this method makes sure that mip does not contain wildcards, it might be the case.

schlunma · 2026-01-07T17:24:55Z

Thanks for reviewing @bouweandela!

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Very good question. I think by providing the method Dataset.load we are already mixing defining and processing input data. To me, the derivation of a variable is not really a preprocessor function like the countless others we have, simply because it requires multiple CMOR input variables. In that sense, it really is closer to the definition of a variable rather than preprocessing. But this opinion is fully subjective of course.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

Yes, I did. However, I found that it was not really practical to have that. The main advantage of this PR is the ability to load derived variables with wildcards (i.e., data where not all required variables are available are skipped). Including this logic into the preprocessor module in some kind of get_required function seemed conceptually wrong (finding data is not preprocessing) and overly complicated. Thus, I opted to include this into the Dataset class. Again, this is a very subjective decision.

I think this easy access to derived variables loaded from arbitrary dataset would be a great feature of ESMValTool. AFAIK, no other packages provides that.

…ed_vars

…ime ranges

… overwritten

…ed_vars

…iables is not possible

Co-authored-by: Bouwe Andela <b.andela@esciencecenter.nl>

…ed_vars

…files

bouweandela · 2026-01-22T09:40:17Z

My apologies for being slow with looking at this. I agree that it would be a great feature, but I don't know if this is the right way to implement it. I would like to investigate if we can find a way to do it without making the Dataset class more complicated. I'll try to find time to do that soon.

schlunma · 2026-02-02T09:59:55Z

Thanks for your answer. I am sorry to hear that this is not the "right" way of implementing it.

It would have been nice to receive this kind of feedback after I opened the corresponding issue in July 2025, after opening an associated PR that also clearly outlined this plan in July 2025, after opening this PR in July 2025, or at least after my answer to your comments last month. This would have saved me at least 3 days of work (adapting this to the new data sources configuration alone took me a full day 2 weeks ago).

schlunma added 30 commits July 13, 2025 15:37

Remove all new features, just keep no-op changes

4b989d3

Further no-op changes

b0c44f6

force_derivation=True without derive=True does not make sense

1dd5671

Add tests

8989549

Add type hints to check.py

1f6dfa3

Added type hints for recipe.py

b6a6651

Added type hints for to_datasets.py

6793e0c

Added type hints for dataset.py

878e310

Add type hints to local.py

be6e55d

Add type hints to preprocessor/__init__.py

b1caf65

Add type hints to compare_with_refs.py

19dbff9

Add type hints to _derive/__init__.py

d8ea7d9

Add type hints to some derive functions

367bfe7

Add type hints to _regrid.py

5bbe6ce

Make new dataset methods private

d10de1e

Small fix

7323866

Fix test

3ab2cdf

Fix mock

099349f

100% test coverage

86b308b

Clean doc

369a811

100% diff coverage

c2a3d81

Try to please Codacy

a3dab12

Make tests work without ESMValTool installation

001eafa

100% diff coverage for real

debd589

Added Dataset.input_datasets

c3df13e

Shorter code

e794817

Merge remote-tracking branch 'origin/type_hints_derive' into from_fil…

7c1bfd7

…es_with_derived_vars

Dataset.set_version can handle derived variables now

b971d50

Dataset._input_datasets is always list[Dataset]

f6b6d22

Make changes fully backwards-compatible

1f4de86

jlenh modified the milestones: v2.13.0, v2.14.0 Aug 21, 2025

bouweandela requested changes Dec 19, 2025

View reviewed changes

schlunma and others added 23 commits January 8, 2026 17:01

Merge remote-tracking branch 'origin/main' into from_files_with_deriv…

cc91794

…ed_vars

Load default data sources in global session fixture and fix first tests

7ec3281

Fixed recipe test

2bfc1fa

We don't need to raise an error if no files are found when updating t…

f0f2b6e

…ime ranges

Fixed existing tests and add one for data with unavailable years

cffdeea

Use static methods to make sure that original Dataset instance is not…

dec25bc

… overwritten

input_datasets -> required_datasets

ff0cdd5

Use bools for facet values of appropriate

de27a4b

Merge remote-tracking branch 'origin/main' into from_files_with_deriv…

58b21ef

…ed_vars

Simplify _get_all_available_datasets

d8f5d08

Simplify _get_all_available_datasets

0962489

Using wildcards for derived variables with only optional required var…

81da6e7

…iables is not possible

Explicitly cast tuple[tuple] to dict

ade0bce

Do not return any files for required variables if no facets match at all

d725654

Add supplementaries to required datasets

d5234a7

Add test cases for derived variables with optional variable

2226ebd

Update esmvalcore/dataset.py

bb72b6a

Co-authored-by: Bouwe Andela <b.andela@esciencecenter.nl>

Fix indentation

8dd2fdf

Merge remote-tracking branch 'origin/main' into from_files_with_deriv…

f3477c0

…ed_vars

FIrst update of notebook

9b28c0f

Update example notebook

6d8ba22

Required datasets don't need supplementaries

d3adbce

Make _derivation_necessary faster by avoiding extra calls to dataset.…

8c055a5

…files

schlunma closed this Feb 2, 2026

		return not copy.files


		def _get_input_datasets(dataset: Dataset) -> list[Dataset]:

		return input_datasets


		def _representative_datasets(dataset: Dataset) -> list[Dataset]:

Conversation

schlunma commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Uh oh!

bouweandela left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schlunma Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schlunma commented Jan 7, 2026

Uh oh!

bouweandela commented Jan 22, 2026

Uh oh!

schlunma commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

schlunma commented Jul 16, 2025 •

edited

Loading

bouweandela left a comment •

edited

Loading

schlunma Jan 14, 2026 •

edited

Loading