Expand Dataset.from_files so it works properly with derived variables#2777
Expand Dataset.from_files so it works properly with derived variables#2777
Dataset.from_files so it works properly with derived variables#2777Conversation
…es_with_derived_vars
There was a problem hiding this comment.
Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.
I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?
If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.
Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?
esmvalcore/dataset.py
Outdated
| return input_datasets | ||
|
|
||
| @property | ||
| def input_datasets(self) -> list[Dataset]: |
There was a problem hiding this comment.
Can we rename this to derived_from or something similar?
There was a problem hiding this comment.
Renamed to required_datasets in ff0cdd5. Would that be okay for you? I think derived_from might be misleading for non-derived variables.
esmvalcore/_recipe/to_datasets.py
Outdated
| return not copy.files | ||
|
|
||
|
|
||
| def _get_input_datasets(dataset: Dataset) -> list[Dataset]: |
There was a problem hiding this comment.
Is this function still needed now that the dataset provides these as an attribute?
There was a problem hiding this comment.
Yes. This function removes non-existent optional required datasets prior to loading them. This can/will be moved to the Dataset.load() (see #2769).
| return input_datasets | ||
|
|
||
|
|
||
| def _representative_datasets(dataset: Dataset) -> list[Dataset]: |
There was a problem hiding this comment.
This function seems no longer needed either
There was a problem hiding this comment.
Again, this can be removed once Dataset.load() works properly for derived variables.
esmvalcore/dataset.py
Outdated
| all_datasets: list[list[tuple[dict, Dataset]]] = [] | ||
| for input_dataset in self._get_input_datasets(): | ||
| all_datasets.append([]) | ||
| for expanded_ds in self._get_available_datasets( | ||
| input_dataset, | ||
| ): | ||
| updated_facets = {} | ||
| for key, value in self.facets.items(): | ||
| if _isglob(value): | ||
| if key in expanded_ds.facets and not _isglob( | ||
| expanded_ds[key], | ||
| ): | ||
| updated_facets[key] = expanded_ds.facets[key] | ||
| new_ds = self.copy() | ||
| new_ds.facets.update(updated_facets) | ||
| new_ds.supplementaries = self.supplementaries | ||
|
|
||
| all_datasets[-1].append((updated_facets, new_ds)) | ||
|
|
||
| # Only consider those datasets that contain all input variables | ||
| # necessary for derivation | ||
| for updated_facets, new_ds in all_datasets[0]: | ||
| other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]] | ||
| if all(updated_facets in facets for facets in other_facets): | ||
| yield new_ds | ||
| else: | ||
| logger.debug( | ||
| "Not all necessary input variables to derive '%s' are " | ||
| "available for %s with facets %s", | ||
| self["short_name"], | ||
| new_ds.summary(shorten=True), | ||
| updated_facets, | ||
| ) |
There was a problem hiding this comment.
This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?
If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?
esmvalcore/dataset.py
Outdated
| def _get_all_available_datasets(self) -> Iterator[Dataset]: # noqa: C901 | ||
| """Yield datasets based on the available files. | ||
|
|
||
| This function requires that self.facets['mip'] is not a glob pattern. |
There was a problem hiding this comment.
I honestly don't know, just copy-pasted this from the existing code. Since the parent function calling this method makes sure that mip does not contain wildcards, it might be the case.
|
Thanks for reviewing @bouweandela!
Very good question. I think by providing the method
Yes, I did. However, I found that it was not really practical to have that. The main advantage of this PR is the ability to load derived variables with wildcards (i.e., data where not all required variables are available are skipped). Including this logic into the preprocessor module in some kind of I think this easy access to derived variables loaded from arbitrary dataset would be a great feature of ESMValTool. AFAIK, no other packages provides that. |
…iables is not possible
Co-authored-by: Bouwe Andela <b.andela@esciencecenter.nl>
|
My apologies for being slow with looking at this. I agree that it would be a great feature, but I don't know if this is the right way to implement it. I would like to investigate if we can find a way to do it without making the Dataset class more complicated. I'll try to find time to do that soon. |
|
Thanks for your answer. I am sorry to hear that this is not the "right" way of implementing it. It would have been nice to receive this kind of feedback after I opened the corresponding issue in July 2025, after opening an associated PR that also clearly outlined this plan in July 2025, after opening this PR in July 2025, or at least after my answer to your comments last month. This would have saved me at least 3 days of work (adapting this to the new data sources configuration alone took me a full day 2 weeks ago). |
Description
This PR expands
Dataset.from_filesso it works properly with derived variables. In addition, a new attributeDataset.input_datasetsis available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within thederivepreprocessor function.This PR is the second step to make
Dataset.loadwork with derived variables.Example
Related to #2769.
Link to documentation:
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
To help with the number pull requests: