add CalAdapt WRF download function for CMIP6 hourly met data#3967
add CalAdapt WRF download function for CMIP6 hourly met data#3967divine7022 wants to merge 19 commits into
Conversation
There was a problem hiding this comment.
This is a great first pass. I've made a few comments; a few key points:
It seems download.CalAdaptWRF() is doing both the download and conversion standardization (q2 transformation, derived wind_speed).
It isn't clear if it is an intentional design choice, but the standard way to handle met ingest is to first download and then standardize. i.e download.CalAdaptWRF() responsible for downloading and met2CF.CalAdaptWRF() responsible for conversion to PEcAn standard met. That seems more consistent with PEcAn’s pattern of deriving standardized variables in the met conversion layer rather than in the data interface itself.
Two other points - 1) it seems like it would be useful to put conversion functions for q = f(q2), wind = f(uwind, vwind), and precip = f(rainc, rainnc) into metutils.R and call those. and 2) I think that the conversion precip = f(rainc, rainnc) can be removed from caladaptaer in order to keep that package's scope to data access (with apologies for previously suggesting otherwise, that was before we were developing a dedicated package intended for the caladaptae ecosystem).
|
|
||
| Availability: 1980--2100 | ||
|
|
||
| Notes: CMIP6 dynamically downscaled projections from the Cal-Adapt Analytics Engine (WUS-D3 dataset, Rahimi et al. 2024). Eight GCMs available under SSP3-7.0; CESM2 also has SSP2-4.5 and SSP5-8.5. Data is publicly available on AWS S3 (no authentication required). Requires the `caladaptaer` package from GitHub. To use this option, set `source` as `CalAdaptWRF` and specify `model` and `scenario` in the `met` section of `pecan.xml`: |
| </met> | ||
| ``` | ||
|
|
||
| Available GCMs: CESM2, CNRM-ESM2-1, EC-Earth3, EC-Earth3-Veg, FGOALS-g3, MPI-ESM1-2-HR, MIROC6, TaiESM1. See `caladaptaer::cae_models(activity = "WRF")` for the current list. |
There was a problem hiding this comment.
- Be more specific, e.g.
| Available GCMs: CESM2, CNRM-ESM2-1, EC-Earth3, EC-Earth3-Veg, FGOALS-g3, MPI-ESM1-2-HR, MIROC6, TaiESM1. See `caladaptaer::cae_models(activity = "WRF")` for the current list. | |
| Available GCMs: CESM2, CNRM-ESM2-1, EC-Earth3, EC-Earth3-Veg, FGOALS-g3, MPI-ESM1-2-HR, MIROC6, TaiESM1. See `caladaptaer::cae_models(activity = "WRF")` for the current list of available climate models. |
- What does 'activity' do?
- add reference to availability of reanalysis data
| #' WRF grids are cached in tempdir() so that when met.process calls this for | ||
| #' multiple sites in the same R session, each grid is only fetched from S3 once. | ||
| #' For 200 sites x 8 vars x 20 years that cuts S3 round trips from 32,000 to 160. | ||
| #' |
There was a problem hiding this comment.
add references, including links to docs and Rahimi paper
document availablity of downscaled ERA5 reanalysis
| #' @param verbose Extra debug output? Default FALSE | ||
| #' @param ... further arguments, currently ignored | ||
| #' | ||
| #' @return invisible data.frame with file info for BETY registration |
There was a problem hiding this comment.
| #' @return invisible data.frame with file info for BETY registration | |
| #' @return invisible data.frame with file information. |
I think it can return the standardized table without implying that the only intent is for BETY registration.
| #' @param outfolder Directory for storing output | ||
| #' @param start_date Start date for met data | ||
| #' @param end_date End date for met data | ||
| #' @param site_id BETY site id |
There was a problem hiding this comment.
I think this param can be defined as unique identifier for site without reference to BETYdb.
Is it still necessary / useful to explicitly support BETYdb with this?
| lat_dim <- ncdf4::ncdim_def("latitude", "degree_north", | ||
| lat.in, create_dimvar = TRUE) | ||
| lon_dim <- ncdf4::ncdim_def("longitude", "degree_east", | ||
| lon.in, create_dimvar = TRUE) |
There was a problem hiding this comment.
these write the site lat/lon rather than the lat lon from the source data. if we want to support potentially multiple sites mapping to the same met input, should this be handled differently here?
| lat.in <- as.numeric(lat.in) | ||
| lon.in <- as.numeric(lon.in) |
There was a problem hiding this comment.
is it worthwhile to add a check that these are inside domain of dataset? I'm not sure if this is worth the computational cost of a point in polygon check, but if there is a bounding box it should be efficient to do lat < maxlat & lat > minlat type checks. If there isn't a bounding box, perhaps caladaptaer should create one. Either way, out of domain lat/lon should be handled gracefully.
| saveRDS(grid, cache_file) | ||
| } | ||
|
|
||
| # grab time dimension and build the projected point once |
There was a problem hiding this comment.
Does CalAdapt gaurantee that, for a given model/scenario/resolution combination, time_vals and grid will be consistent? is the grid the same across all data at the same resolution?
| if (is.null(model)) model <- "CESM2" | ||
| if (is.null(scenario)) scenario <- "ssp370" | ||
| if (is.null(resolution)) resolution <- "d01" |
There was a problem hiding this comment.
| if (is.null(model)) model <- "CESM2" | |
| if (is.null(scenario)) scenario <- "ssp370" | |
| if (is.null(resolution)) resolution <- "d01" | |
| model <- model %||% "CESM2" | |
| scenario <- scenario %||% "ssp370" | |
| resolution <- resolution %||% "d01" |
| #' | ||
| #' Fetches hourly WRF dynamically downscaled data from the Cal-Adapt Analytics | ||
| #' Engine (CADCAT S3 bucket) via caladaptaer, extracts the nearest grid cell to | ||
| #' the site, converts units to CF-1.8, and writes one NetCDF per year. |
There was a problem hiding this comment.
to be strictly CF-1.8 compliant, files need a global metadata attribute Conventions = CF-1.8.
| start_time = year_start, | ||
| end_time = year_end | ||
| ) | ||
| saveRDS(grid, cache_file) |
There was a problem hiding this comment.
is this safe if run in parallel?
Description
adds
download.CalAdaptWRF()toPEcAn.data.atmospherenew met driver that pulls hourly WRF dynamically downscaled CMIP6 projections from the Cal-Adapt Analytics Engine (WUS-D3 dataset, Rahimi et al. 2024). Data sits on public AWS S3, no auth neededthis is implemented as a part of CCMMF where we need future climate forcing at ~200 California sites for SIPNET runs under multiple GCMs and SSPs.
looked at how pecan handles met downloads and this follows the same pattern as CRUNCEP/GFDL, the download function does everything (fetch, extract, convert, write CF) in one shot, so we skip the met2CF and extract.nc stages. Main reason: WRF uses a Lambert Conformal grid and
extract.nc/closest_xyassume lat-lon grids with NARR-style bounds, so they can't handle this projection. Adding CalAdaptWRF to skip from list in met.process.R was the ryt directionthe tricky part is that pecan's
papplycallsdownload.CalAdaptWRFonce per site, it never sees the full site list. Naively that means re-reading the same WRF grid from S3 for every site. The 45 km grid is small (~20-30 MB per var per year), so we cache the full grid as rds intempdir()on the first site. Sites 2 through N just doreadRDS()and extract their grid cell locally. For 200 sites x 8 vars x 20 years, that cuts S3 round trips from 32,000 to 160. The cache auto cleans when R exitsadded
caladapt_wrfcolumn topecan_standard_met_table9 output variables total:
data coverage:
pipeline that orchestrates R (caladaptaer) and .sh scripts, along with the data, is here at /projectnb/dietzelab/ccmmf/ensemble/CalAdapt_runs/
for 198 design points × 3 GCM/SSP scenarios (CESM2.ssp245, CESM2.ssp370, MPI-ESM1-2-HR.ssp370) × 2025–2045 hourly; if anyone wants to poke at the outputs
Motivation and Context
Review Time Estimate
Types of changes
Checklist: