Revised config and code quality improvements by clessig · Pull Request #1541 · ecmwf/WeatherGenerator

clessig · 2025-12-30T22:28:54Z

Description

Revise config to nested dict; simplify code where possible and where changes are necessary anyway.

This PR also enables a more flexible combination of different loss terms, e.g. of a physical space and latent loss, as demonstrated in the default config. It also decouples training and validation and test as much as possible, so that one can have different objectives for these.

Issue Number

Closes #1534
Closes #1535

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…training_config and validation_config

…Physical ones.

…culator and various other details cleaned up

…of config is passed to LRScheduler, which leads to major simplifications

…ulator config

…e model

… progress

…sig/develop/fix_config_1534

…hub.com:ecmwf/WeatherGenerator into clessig/develop/fix_config_1534

…sig/develop/fix_config_1534

…allows to specify number of samples; Added copyright statement

…yiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing.

* Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

* Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

clessig · 2026-01-13T18:11:50Z

Default integration test is working. train_continue and inference are also working.

* Partially revised config; model is still missing but proper setup of training_config and validation_config * Changes necessary due to changed position of time keys and of run_id * Handling of multiple loss terms / target_aux_calculators and non-LossPhysical ones. * Changed position of run_id in config * Add function to extract batch size from mode_cfg * Changed position of run_id in config * Changes due to revised config. Also proper handling of target_aux_calculator and various other details cleaned up * Revised config structure, in particular for losses, and related changes * Add missing copyright and minor changed to to_device() * Moved sanity checking from trainer here. Also learning_rate sub_part of config is passed to LRScheduler, which leads to major simplifications * Minor cleanups * Changes due to changed structure of losses in config * Changes due to changed structure of losses in config * Minor changes due to changed position of run_id in config * Minor changes to accomodate new config, in particular target_aux_calculator config * Support batch_size > 1. Clean up of various smaller parts * Clean up and implementation for batch_size > 1. * Fix to sharding problem with FSDP2 * Removed scatter offset computation which now happens on the fly in the model * Changes for revised config, simplify overall where possible * Fix issues with source-target sample generation and matching. Work in progress * Linting * Linting * Linting * Linting * Type hint * Linting * Linting * Linting * Renamed loss keys for consistency * implement reader merge * Long list of fixes and improvements * Enabled support for minimal configs without rate * Fixed validation. validation_io still broken * Fixed linting * Fixed problem with target filtering for loss computation for SSL losses * working version of merge reader * linter * lint * fix lead time * Re-instantiated per loss-fct source/target correspondences. Introduced idx and correspondence fields to per sample meta-data which makes correct correspondence for loss computation much easier. * Fixed problem with undefined variable * Revised config * Fixed bug with forecasting * Added sanity check for config * Fix bug with duplicate targets * Linting * Fixed problem when losses is not specified in validation config * Fix DINOv2 * Removed temporary patches; fixed properly in 10b7a28 * Linting * Patched validation IO. Needs to be fixed properly. * Removed unused function * Improved variable naming * Improved encapsulation of functionality: total_batch_size * Fixed broken inference * Fixed problem with test where incorrect config was used * Fixed processing and handling of spoof flag in loss calculation * Fixed problem with pure masking where forecast_steps were 0. Removed duplicate function introduced through merge problem * Fixed bug when output_streams is specified explicitly * Corrected config param for number of samples * Fixed bug in handling of spoof weight * Improved clarity of logging statements * Improved logging msgs * Fix sinkhorn knopp * Fix sinkhorn in multi-GPU mode * Removed some old comments * Fixed inference overwrites * Fixing empty output when masking * Intermediate stage to re-enable integration test * Adjusted thresholds * Renaming * Removing old config files * Adding copyright * Revised default_config. This is a minimal example config for simple training towards forecasting * Changed multiprocessing param * Adapation for new position of multiprocessing param * Adding example config that combines an SSL and physical loss term * More cleanup * Restoring some default values * Restoring default for decoder_type * update to develop * Fixed problem where parameter was expected in old config place * Fixed linting * Simplified interface * Re-enabled forecast step and location weighting * Linting * Using new option to have validate_before_training as an int arg that allows to specify number of samples; Added copyright statement * Added option to have validate_before_training as int argument (specifyiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing. * Refactored correspondence parsing * Sophiex/dev/teacher overrides (ecmwf#1557) * Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (ecmwf#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fixed missing default value * Bilinear decoder: adapt code for batchsize > 1 (ecmwf#1592) * Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Changed defaults * Linting * Fixed linting issue * Reverting to ERA5-only as default * Fixed problem with train_continue --------- Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com>

* Remove mini_epoch backward compatibility * Update eval_config.yml (#1584) * Repeat flag on develop (#1562) * Squashed commit of the following: commit 9336fe1 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Fri Dec 12 20:10:50 2025 +0100 requested changes commit dadde23 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 18:54:44 2025 +0100 remove 1 line commit c871f9c Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 18:16:50 2025 +0100 remove unnecessary statement commit e3e46eb Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 12:49:03 2025 +0100 lint commit 559add7 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 12:47:35 2025 +0100 rename flag and simplify cases commit f6e1c39 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Thu Dec 4 21:07:42 2025 +0100 reset config and lint commit 27cb0c8 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Thu Dec 4 20:57:14 2025 +0100 repeat flag commit bf17bfe Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 16:53:51 2025 +0100 Updated config commit 7745e47 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 16:35:19 2025 +0100 Switched to lists of model / target stratgies commit 12bae15 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 15:01:07 2025 +0100 Fixes for diffusion commit 9065219 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:33:42 2025 +0100 Changed that model takes sample as input commit 3f52a8d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:32:14 2025 +0100 Changed core functions to take sample as arg commit d36367a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:31:55 2025 +0100 Changed args to embedding commit b69b743 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:30:41 2025 +0100 Cleaned up comments and return values a bit commit 59510dd Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:01:50 2025 +0100 Fixed problem with non_blocking=True commit 69b53a6 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:00:42 2025 +0100 Removed old comments commit 51754fa Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:00:20 2025 +0100 Fixed missing non_blocking=True in to_device() commit 2cd3971 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 23:56:41 2025 +0100 Completed migration to new batch class by removing reference to old list of lists commit 402b8de Author: Julian Kuehnert <Jubeku@users.noreply.github.com> Date: Wed Dec 3 17:11:15 2025 +0100 1390 - Adapt forward pass of new batch object (#1391) * Add to device to ModelBatch, etc & adapt model TODO adapt validate and inference TODO test forecasting and multiple stream because predict changed substantially * Rename view to sample and fix validate * Revert predict function and fix inference * Fix invalid access with mask * Linting * Fixed handling of target_idxs and other minor issues --------- Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> commit 9a1a6a9 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 13:12:52 2025 +0100 Re-enabled multi-source training commit 3641e1f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:20:42 2025 +0100 Fix for integration test commit 9f5e49c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:20:25 2025 +0100 Fixed uv.lock commit 33d9d8d Merge: 23e0267 c8a2aad Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:13:05 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 23e0267 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:11:48 2025 +0100 Update commit c8a26d7 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:11:37 2025 +0100 Commit commit 2599ec2 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:10:13 2025 +0100 Restructured code so that mask generation and application is cleanly separated commit c8a2aad Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 17:06:56 2025 +0100 commenting tests commit 2b2c977 Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 17:03:41 2025 +0100 linter warnings commit dc736e5 Merge: 6fe8561 7ff6e0b Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 16:48:24 2025 +0100 merge with dev commit 6fe8561 Merge: 15b46e9 f136d60 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 14:16:41 2025 +0100 Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 15b46e9 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 13:30:54 2025 +0100 fix indentation of else: assert False in _get_sample msds commit 4281aff Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 12:40:24 2025 +0100 restore loader_num_workers to 8 commit 6ea07e7 Author: Seb Hickman <56727418+shmh40@users.noreply.github.com> Date: Fri Nov 28 11:34:41 2025 +0000 restore masking_strategy to random Had placeholder for testing, now back to "random" for masking strategy in the base level of default_config commit 1a37dd1 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 10:31:43 2025 +0100 remove unused mask generation in diffusion_forecast commit 657094a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:59:39 2025 +0100 Fixed problem in engines introduced in recent commits merging develop. This fixes masking training commit d526dfc Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:37:02 2025 +0100 Restored masking as training mode. Not working due to NaN in prediction commit 6289959 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:36:38 2025 +0100 Removed duplicate lines due to mergeing commit bc8d23e Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:18:01 2025 +0100 More linting commit 47750a5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:10:09 2025 +0100 Restoring masking as training_mode in default_config commit 0db8b62 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:41 2025 +0100 Linting commit e41a575 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:28 2025 +0100 Linting commit 03166a2 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:10 2025 +0100 Linting commit 652500a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:53 2025 +0100 Linting commit d8998a9 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:38 2025 +0100 Linting commit 8ef3a4c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:04 2025 +0100 Simplified and clarified handling of default target_aux_calcualtor commit 3e4de7a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:07:51 2025 +0100 Linting commit 5f803e5 Merge: b47b0fa 0e2801b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:03:02 2025 +0100 Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit b47b0fa Merge: 9b702c5 26f7b5b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 07:09:19 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 26f7b5b Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Thu Nov 27 15:33:22 2025 +0100 add diffusion forecast option for the data sampling, and with noise_level_rn in the metadata. The Trainer needs to be copied from Sophies branch, currently we only get so far commit 6d909d6 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Thu Nov 27 11:32:32 2025 +0100 add mask to SampleMetaData and add forecast_dt to Sample so it is accessible. Can specify the loss in the default config with student-teacher views commit e0d7346 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Wed Nov 26 14:31:52 2025 +0100 remove prints, pdb commit c27156c Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Wed Nov 26 12:35:03 2025 +0100 add SampleMetaData integration and functionality, and update masker to use SampleMetadata. Pass through source_cell_lens and target_coords_idx to student_teacher_batch in iter, and hence pass through to trainer. source_cell_lens and target_coords_idx are now part of Sample, which is itself the components of ModelBatch. To tidy commit 4f8f62b Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Tue Nov 25 18:56:56 2025 +0100 instructions for sophie commit fa24fc1 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Tue Nov 25 16:36:52 2025 +0100 very hacky first pass of full masking_strategy_config for the student and teacher views. Much to fix up commit b193a50 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Mon Nov 24 17:13:37 2025 +0100 updated configs so code runs. Note default config to be overhauled still commit af9a3c1 Merge: 2905cb0 b452bd2 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Mon Nov 24 16:37:55 2025 +0100 merge with develop, include trainer idx_inv_rt, merged default_config, rm tokenizer_forecast commit 2905cb0 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Sat Nov 22 13:59:37 2025 +0000 fix masking for NPP-ATMS by correctly selecting final timestep mask and aligning between source and target. working for num_input_steps = 1, broken for > 1, compute_offsets_scatter_embed not working commit b9a60f3 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 18:38:40 2025 +0000 tidy up, remove unused arguments, types commit ece1dd0 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 16:22:27 2025 +0000 move build_views_for_stream into masker commit 1a418bf Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 12:54:33 2025 +0000 add max_num_samples functionality to tokenizer_masking and pass through in multi_stream_data_sampler. coords_per_cell is a bit nasty commit 91c3d7a Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 12:53:31 2025 +0000 add max_num_targets to era5 commit 647e4b2 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 18:31:45 2025 +0000 multiple idxs for each teacher, need to confirm for not student case, and updated ModelBatch for this commit 1806ae5 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 16:28:30 2025 +0000 tidy up, remove unused build_stream_views in tokenizer_masking commit 9b702c5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 14:34:34 2025 +0100 Re-enabling inversion of targert ordering. commit 87ad45f Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:10:34 2025 +0000 add teacher num_views parameter to config commit b34b6da Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:09:19 2025 +0000 collect num_source_samples and num_target_samples, add loop over teacher masks hence allowing multiple teacher views, and add source_target_idx to keep track of which student belongs to which teacher commit b2be982 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:07:47 2025 +0000 fix typo in ModelBatch commit d18cf86 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:26:40 2025 +0100 Added todo commit e8ccb8d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:22:26 2025 +0100 Added required reflexivity between source and target samples to Batch commit 5d5e999 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:21:31 2025 +0100 Linting problems but removed unused ViewMetaData dependence commit 3bca490 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:21:13 2025 +0100 linting commit 6a96065 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:20:42 2025 +0100 Linting commit c1d32fb Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:20:21 2025 +0100 linting commit 1b1654c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 22:32:05 2025 +0100 Added basic support for use of ModelBatch class to define rough structure and interface. commit 848880b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 20:06:41 2025 +0100 Renaming and minor clean up. commit 6d685c0 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:57:46 2025 +0100 Moved _get_student_teacher_masks() so that masks are generated for all streams first. commit ed26c02 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:57:23 2025 +0100 Changes to have spoofing on a per data reader sample commit 9fe94f5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:30:48 2025 +0100 Changes necessary for spoofing flag per IOReaderData commit 4613f7a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:58:10 2025 +0100 Cleaned up parametrization commit 1235aab Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:47:40 2025 +0100 More refactoring. Code working again. commit 1e70f5c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:09:20 2025 +0100 More refactoring and cleanup commit 46147d4 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:01:29 2025 +0100 More refactoring commit 81cf929 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 15:58:57 2025 +0100 Changes for better student teacher structure commit dfc03f2 Merge: a824bfc 31dc658 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 15:58:37 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit a824bfc Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 12:23:47 2025 +0100 Not working draft for restructuring commit 31dc658 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Wed Nov 19 11:04:29 2025 +0000 created function for _get_student_teacher_sample_data which returns the streams_data of the teacher and multiple streams_datas for the student views. commit 2536cec Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:40:26 2025 +0000 correct imports with new batch.py commit b3dfa2f Merge: 11ad4e6 c1580c4 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:36:15 2025 +0000 merge changes commit 11ad4e6 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:34:19 2025 +0000 basic if statement to yield the student and teacher views commit 36ea287 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:33:53 2025 +0000 slight restructure of ViewMetadata commit 66cf9cd Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:33:08 2025 +0000 added stream id to era5 config commit 3c26ddc Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:32:00 2025 +0000 updated default config training_config to allow student-teacher commit c1580c4 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 16:30:44 2025 +0100 Renaming commit 85fa139 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 16:28:46 2025 +0100 Comments commit dd6f85a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 15:30:22 2025 +0100 Added mode and refactored get_sample_data into separate function. commit 668912d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 13:47:40 2025 +0100 Partially enabled correct handling of multiple input steps. commit c3b5c3b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 12:02:17 2025 +0100 Added basic support for multi-step sources. commit ab9eecc Merge: a934f97 c733280 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 10:00:37 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit a934f97 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 09:58:19 2025 +0100 NOT WORKING: updating class to handle multiple input steps and improving overall structure commit c733280 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:32:40 2025 +0000 change view_metadata to dict in ModelInput commit 7d5c300 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:22:33 2025 +0000 draft of training_config in default_config commit 047b299 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:19:56 2025 +0000 draft changes to allow global local view generation in masker and tokenizer_masking. generate the mask, otherwise using batchify_source and batchify_target as before, with the capacity to remember what mask we have now when it comes to generating the targets. Update to inputs_metadata structure but not put in to practice commit 761e263 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:13:57 2025 +0000 update ViewMetadata spec commit 7f3c718 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Mon Nov 17 14:51:01 2025 +0100 Updating config to working version commit ae5a2e6 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 11:54:18 2025 +0000 added file with ModelBatch and SampleMetadata dataclasses commit debbb8f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Mon Nov 17 12:28:07 2025 +0100 Changes to prepare_logging to apply index inversion commit 5d127bf Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Sun Nov 16 17:01:08 2025 +0100 Inversion of target output ordering to match input one in forcast mode. Unclear how to deal with it with MTM commit 8fa544d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 20:43:57 2025 +0100 Removed unused parameters commit ce6c735 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 16:56:51 2025 +0100 Removing centroids options for embedding that was unused and should not be used. commit 0634105 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 09:59:13 2025 +0100 Enabled support for forecast. Cleaned up some bits and pieces. commit ec38123 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 08:27:21 2025 +0100 Fixed remaining problems that occured for NPP-ATMS and SYNOP. TODO: - Forecast still needs to be adapted - Some more cleanup of variable naming, return values etc commit db6f285 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:26:31 2025 +0100 Fixed linting commit 9229e48 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:19:21 2025 +0100 Minor cleanup commit a581405 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:17:29 2025 +0100 Working version for ERA5, NPP-ATMS. Problems with SYNOP with empty cell handling commit e4a9cc0 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 18:58:28 2025 +0100 Masking target is working in principle but errors when feeding data to the model. commit 51f437f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 07:04:23 2025 +0100 NOT WORKING: Finished src, target still to be done. commit 81bd6eb Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 12 09:38:53 2025 +0100 NOT WORKING: initial draft for index-based masking. Implemented for random and healpix masking. Open issues with _coords_local, centroids and probably other things. * batch * adjusted to develop * one line * tiny fix * better messaging * incorporate requested changes * remove extra layer norms (#1589) * Iluise/fix lead time (#1571) * implement reader merge * working version of merge reader * linter * lint * fix lead time * update to develop * [1539][infra] Adds base config flag (#1573) * set base config (#1539) * update help message * longer variable name * longer variable name * rename config variable * rename base_configs --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Revised config and code quality improvements (#1541) * Partially revised config; model is still missing but proper setup of training_config and validation_config * Changes necessary due to changed position of time keys and of run_id * Handling of multiple loss terms / target_aux_calculators and non-LossPhysical ones. * Changed position of run_id in config * Add function to extract batch size from mode_cfg * Changed position of run_id in config * Changes due to revised config. Also proper handling of target_aux_calculator and various other details cleaned up * Revised config structure, in particular for losses, and related changes * Add missing copyright and minor changed to to_device() * Moved sanity checking from trainer here. Also learning_rate sub_part of config is passed to LRScheduler, which leads to major simplifications * Minor cleanups * Changes due to changed structure of losses in config * Changes due to changed structure of losses in config * Minor changes due to changed position of run_id in config * Minor changes to accomodate new config, in particular target_aux_calculator config * Support batch_size > 1. Clean up of various smaller parts * Clean up and implementation for batch_size > 1. * Fix to sharding problem with FSDP2 * Removed scatter offset computation which now happens on the fly in the model * Changes for revised config, simplify overall where possible * Fix issues with source-target sample generation and matching. Work in progress * Linting * Linting * Linting * Linting * Type hint * Linting * Linting * Linting * Renamed loss keys for consistency * implement reader merge * Long list of fixes and improvements * Enabled support for minimal configs without rate * Fixed validation. validation_io still broken * Fixed linting * Fixed problem with target filtering for loss computation for SSL losses * working version of merge reader * linter * lint * fix lead time * Re-instantiated per loss-fct source/target correspondences. Introduced idx and correspondence fields to per sample meta-data which makes correct correspondence for loss computation much easier. * Fixed problem with undefined variable * Revised config * Fixed bug with forecasting * Added sanity check for config * Fix bug with duplicate targets * Linting * Fixed problem when losses is not specified in validation config * Fix DINOv2 * Removed temporary patches; fixed properly in 10b7a28 * Linting * Patched validation IO. Needs to be fixed properly. * Removed unused function * Improved variable naming * Improved encapsulation of functionality: total_batch_size * Fixed broken inference * Fixed problem with test where incorrect config was used * Fixed processing and handling of spoof flag in loss calculation * Fixed problem with pure masking where forecast_steps were 0. Removed duplicate function introduced through merge problem * Fixed bug when output_streams is specified explicitly * Corrected config param for number of samples * Fixed bug in handling of spoof weight * Improved clarity of logging statements * Improved logging msgs * Fix sinkhorn knopp * Fix sinkhorn in multi-GPU mode * Removed some old comments * Fixed inference overwrites * Fixing empty output when masking * Intermediate stage to re-enable integration test * Adjusted thresholds * Renaming * Removing old config files * Adding copyright * Revised default_config. This is a minimal example config for simple training towards forecasting * Changed multiprocessing param * Adapation for new position of multiprocessing param * Adding example config that combines an SSL and physical loss term * More cleanup * Restoring some default values * Restoring default for decoder_type * update to develop * Fixed problem where parameter was expected in old config place * Fixed linting * Simplified interface * Re-enabled forecast step and location weighting * Linting * Using new option to have validate_before_training as an int arg that allows to specify number of samples; Added copyright statement * Added option to have validate_before_training as int argument (specifyiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing. * Refactored correspondence parsing * Sophiex/dev/teacher overrides (#1557) * Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fixed missing default value * Bilinear decoder: adapt code for batchsize > 1 (#1592) * Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Changed defaults * Linting * Fixed linting issue * Reverting to ERA5-only as default * Fixed problem with train_continue --------- Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * [1601] Remove hardcoded optimizer variable eps (#1602) * rm hardcoded optimizer variable eps * set default for eps in optimizer * WeatherGenerator JSON reader (#1461) * split WeatherGenReader functionality to allow reading only JSON adding weathergen JSON reader to develop * informative error when metrics are not there * restore JSONreader after rebase * JSONreader mostly restored * MLFlow logging independent of JSON/zarr * linting, properly cheking fsteps, ens, samples in JSONreader * tiny change to restore the MergeReader * lint --------- Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> * Filter configs using enabled flag (#1604) * Partially revised config; model is still missing but proper setup of training_config and validation_config * Changes necessary due to changed position of time keys and of run_id * Handling of multiple loss terms / target_aux_calculators and non-LossPhysical ones. * Changed position of run_id in config * Add function to extract batch size from mode_cfg * Changed position of run_id in config * Changes due to revised config. Also proper handling of target_aux_calculator and various other details cleaned up * Revised config structure, in particular for losses, and related changes * Add missing copyright and minor changed to to_device() * Moved sanity checking from trainer here. Also learning_rate sub_part of config is passed to LRScheduler, which leads to major simplifications * Minor cleanups * Changes due to changed structure of losses in config * Changes due to changed structure of losses in config * Minor changes due to changed position of run_id in config * Minor changes to accomodate new config, in particular target_aux_calculator config * Support batch_size > 1. Clean up of various smaller parts * Clean up and implementation for batch_size > 1. * Fix to sharding problem with FSDP2 * Removed scatter offset computation which now happens on the fly in the model * Changes for revised config, simplify overall where possible * Fix issues with source-target sample generation and matching. Work in progress * Linting * Linting * Linting * Linting * Type hint * Linting * Linting * Linting * Renamed loss keys for consistency * implement reader merge * Long list of fixes and improvements * Enabled support for minimal configs without rate * Fixed validation. validation_io still broken * Fixed linting * Fixed problem with target filtering for loss computation for SSL losses * working version of merge reader * linter * lint * fix lead time * Re-instantiated per loss-fct source/target correspondences. Introduced idx and correspondence fields to per sample meta-data which makes correct correspondence for loss computation much easier. * Fixed problem with undefined variable * Revised config * Fixed bug with forecasting * Added sanity check for config * Fix bug with duplicate targets * Linting * Fixed problem when losses is not specified in validation config * Fix DINOv2 * Removed temporary patches; fixed properly in 10b7a28 * Linting * Patched validation IO. Needs to be fixed properly. * Removed unused function * Improved variable naming * Improved encapsulation of functionality: total_batch_size * Fixed broken inference * Fixed problem with test where incorrect config was used * Fixed processing and handling of spoof flag in loss calculation * Fixed problem with pure masking where forecast_steps were 0. Removed duplicate function introduced through merge problem * Fixed bug when output_streams is specified explicitly * Corrected config param for number of samples * Fixed bug in handling of spoof weight * Improved clarity of logging statements * Improved logging msgs * Fix sinkhorn knopp * Fix sinkhorn in multi-GPU mode * Removed some old comments * Fixed inference overwrites * Fixing empty output when masking * Intermediate stage to re-enable integration test * Adjusted thresholds * Renaming * Removing old config files * Adding copyright * Revised default_config. This is a minimal example config for simple training towards forecasting * Changed multiprocessing param * Adapation for new position of multiprocessing param * Adding example config that combines an SSL and physical loss term * More cleanup * Restoring some default values * Restoring default for decoder_type * update to develop * Fixed problem where parameter was expected in old config place * Fixed linting * Simplified interface * Re-enabled forecast step and location weighting * Linting * Using new option to have validate_before_training as an int arg that allows to specify number of samples; Added copyright statement * Added option to have validate_before_training as int argument (specifyiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing. * Refactored correspondence parsing * Sophiex/dev/teacher overrides (#1557) * Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fixed missing default value * Bilinear decoder: adapt code for batchsize > 1 (#1592) * Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Changed defaults * Linting * Fixed linting issue * Reverting to ERA5-only as default * Fixed problem with train_continue * Adding filtering of config based on enabled/disabled --------- Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * Fixed bug with frequency parameter (#1611) * Fix problem with str indices in source/target config (#1619) * Move register & class tokens to be added earlier * Fix problem with str indices in source/target config * Fixed comment --------- Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * Sorcha/dev/zarr3 compaction (#1450) * update dependencies to zarr3/experimental anemoi (#1253) * upper-bounding eccodes * zarr3 changes * linting * porblem with new evaluate dependencies (removed temporarily for testing common.io) * revert pyproject * first draft * commit to merge * commit to change branch * trying to remove metadata (too many zarr.json files) * zipstore * working (lot of debug prints to remove) * adding flag * WIP: adding flag * neaten up * wrapping zarruserwarning + linting * changes * fixing warnings * fixes * groups * change writer * switch group * reverting, issue is more complex than thought * post review changes * linting * fixing zarrio * linting * fixing create default arg * small change to fix export * linting * Simon/zarr3 compaction/refactoring (#1553) * make zarrio subclasses * store string literals for output storage in enum. * debugging * small fix for export * removing stream_dict lines in run_evaluation * removing timeit * pyproject.toml removing change * adding comment * removing zarrio writer in trainer.py * Set output dataset metadata in creation of zarr group to avoid incremental metadata writes. (#1593) * WIP:removing zarr_store flag * fixing duplication error * need mode="a" to avoid overwriting * adding comments for mode = "a" * debug w/prints * renaming reader to avoid conflict * debugging * renaming zarrio writer/reader to avoid conflicts * lint-check fix * type-check fixes * lint fix * type-check errors * revert lead_time fix * tidying --------- Co-authored-by: Simon Grasse <161459968+grassesi@users.noreply.github.com> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * Removed unused mask_params return value (#1626) * remove unused config parameter (#1632) * Update eval_config.yml (#1636) Add some supports that are missing as comments. * Jk/develop/1639 fix shard val forward (#1642) * rm model_forward assignment in val * rm clutter from diffusion branch * reverse if order * Clessig/develop/fix finetuning 1640 (#1641) * Fix bug with diagnostic streams * Avoid that empty decoders are allocated * Sophiex/dev/synop nppatms finetuning configs (#1644) * Doing something wrong * Make fine-tuning work * Rename sensibly * Enable multiple student views for one target for JEPA (#1617) * Enable multiple student views for one target * Improved readability * Fix test for empty targets in decoder creation (#1646) * add regions to integration tests (#1648) * Memory pinning (#1615) * add pin mem to IOReaderData * add pin mem to sample & modelbatch class * add pin mem to stream data * add pin mem to training loop * run /scripts/actions.sh lint * run ./scripts/actions.sh unit-test * ignore check torch import in package * move pinning to MultiStreamDataSampler * add _pin_tensor & _pin_tensor_list helper func * ruff the code * move back pin mem. to train loop * Remove the ignore-import-error rule and revert to the state before the change * create protocol for pinnable obj * remove pin_mem from IOReaderData class * add pin_memory to Trainer.validate * remove pin_memory from loader_params * Rever export/export_inference.py to state before c3fc9a7 * change name * revise Pinnable class description * add memory_pinning in config, train & va loop * use getattr to avoid CICD warning * use setattr to avoid CICD warning * disable pylint for self.source_tokens_lens * Fixed issues with memory pinning due to rebasing and also adjusted config position of flag * Reverting unadvert changes --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Javad Kasravi <jkasravi@santis-ln002.cscs.ch> Co-authored-by: Javad kasravi <kasravi66@gmail.com> * Allows for writing normalized samples; fixed config to keep it well-structured (#1653) * Skipping missing scores in JSONreader (#1655) * split WeatherGenReader functionality to allow reading only JSON adding weathergen JSON reader to develop * informative error when metrics are not there * restore JSONreader after rebase * JSONreader mostly restored * MLFlow logging independent of JSON/zarr * linting, properly cheking fsteps, ens, samples in JSONreader * tiny change to restore the MergeReader * lint * enabling JSONreader to skip plots and missing scores gracefully * required reformatting * move skipping of metrics to the reader class * slighly more explicit formulations --------- Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> * Remove mini_epoch backward compatibility v2 --------- Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Moritz Hauschulz <60788263+moritzhauschulz@users.noreply.github.com> Co-authored-by: kctezcan <kctezcan@gmail.com> Co-authored-by: Michael Tarnawa <18899420+mtar@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com> Co-authored-by: s6sebusc <49226935+s6sebusc@users.noreply.github.com> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: Sorcha Owens <73587207+enssow@users.noreply.github.com> Co-authored-by: Simon Grasse <161459968+grassesi@users.noreply.github.com> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Javad Kasravi <jkasravi@santis-ln002.cscs.ch> Co-authored-by: Javad kasravi <kasravi66@gmail.com>

* Remove mini_epoch backward compatibility * Update eval_config.yml (ecmwf#1584) * Repeat flag on develop (ecmwf#1562) * Squashed commit of the following: commit 9336fe1 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Fri Dec 12 20:10:50 2025 +0100 requested changes commit dadde23 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 18:54:44 2025 +0100 remove 1 line commit c871f9c Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 18:16:50 2025 +0100 remove unnecessary statement commit e3e46eb Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 12:49:03 2025 +0100 lint commit 559add7 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Mon Dec 8 12:47:35 2025 +0100 rename flag and simplify cases commit f6e1c39 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Thu Dec 4 21:07:42 2025 +0100 reset config and lint commit 27cb0c8 Author: moritzhauschulz <moritz.hauschulz@gmail.com> Date: Thu Dec 4 20:57:14 2025 +0100 repeat flag commit bf17bfe Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 16:53:51 2025 +0100 Updated config commit 7745e47 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 16:35:19 2025 +0100 Switched to lists of model / target stratgies commit 12bae15 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 15:01:07 2025 +0100 Fixes for diffusion commit 9065219 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:33:42 2025 +0100 Changed that model takes sample as input commit 3f52a8d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:32:14 2025 +0100 Changed core functions to take sample as arg commit d36367a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:31:55 2025 +0100 Changed args to embedding commit b69b743 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 13:30:41 2025 +0100 Cleaned up comments and return values a bit commit 59510dd Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:01:50 2025 +0100 Fixed problem with non_blocking=True commit 69b53a6 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:00:42 2025 +0100 Removed old comments commit 51754fa Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Dec 4 00:00:20 2025 +0100 Fixed missing non_blocking=True in to_device() commit 2cd3971 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 23:56:41 2025 +0100 Completed migration to new batch class by removing reference to old list of lists commit 402b8de Author: Julian Kuehnert <Jubeku@users.noreply.github.com> Date: Wed Dec 3 17:11:15 2025 +0100 1390 - Adapt forward pass of new batch object (ecmwf#1391) * Add to device to ModelBatch, etc & adapt model TODO adapt validate and inference TODO test forecasting and multiple stream because predict changed substantially * Rename view to sample and fix validate * Revert predict function and fix inference * Fix invalid access with mask * Linting * Fixed handling of target_idxs and other minor issues --------- Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> commit 9a1a6a9 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 13:12:52 2025 +0100 Re-enabled multi-source training commit 3641e1f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:20:42 2025 +0100 Fix for integration test commit 9f5e49c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:20:25 2025 +0100 Fixed uv.lock commit 33d9d8d Merge: 23e0267 c8a2aad Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:13:05 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 23e0267 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:11:48 2025 +0100 Update commit c8a26d7 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:11:37 2025 +0100 Commit commit 2599ec2 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Dec 3 00:10:13 2025 +0100 Restructured code so that mask generation and application is cleanly separated commit c8a2aad Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 17:06:56 2025 +0100 commenting tests commit 2b2c977 Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 17:03:41 2025 +0100 linter warnings commit dc736e5 Merge: 6fe8561 7ff6e0b Author: Tim Hunter <tim.hunter@ecmwf.int> Date: Tue Dec 2 16:48:24 2025 +0100 merge with dev commit 6fe8561 Merge: 15b46e9 f136d60 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 14:16:41 2025 +0100 Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 15b46e9 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 13:30:54 2025 +0100 fix indentation of else: assert False in _get_sample msds commit 4281aff Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 12:40:24 2025 +0100 restore loader_num_workers to 8 commit 6ea07e7 Author: Seb Hickman <56727418+shmh40@users.noreply.github.com> Date: Fri Nov 28 11:34:41 2025 +0000 restore masking_strategy to random Had placeholder for testing, now back to "random" for masking strategy in the base level of default_config commit 1a37dd1 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Fri Nov 28 10:31:43 2025 +0100 remove unused mask generation in diffusion_forecast commit 657094a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:59:39 2025 +0100 Fixed problem in engines introduced in recent commits merging develop. This fixes masking training commit d526dfc Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:37:02 2025 +0100 Restored masking as training mode. Not working due to NaN in prediction commit 6289959 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:36:38 2025 +0100 Removed duplicate lines due to mergeing commit bc8d23e Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:18:01 2025 +0100 More linting commit 47750a5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:10:09 2025 +0100 Restoring masking as training_mode in default_config commit 0db8b62 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:41 2025 +0100 Linting commit e41a575 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:28 2025 +0100 Linting commit 03166a2 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:09:10 2025 +0100 Linting commit 652500a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:53 2025 +0100 Linting commit d8998a9 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:38 2025 +0100 Linting commit 8ef3a4c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:08:04 2025 +0100 Simplified and clarified handling of default target_aux_calcualtor commit 3e4de7a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:07:51 2025 +0100 Linting commit 5f803e5 Merge: b47b0fa 0e2801b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 08:03:02 2025 +0100 Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit b47b0fa Merge: 9b702c5 26f7b5b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 28 07:09:19 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit 26f7b5b Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Thu Nov 27 15:33:22 2025 +0100 add diffusion forecast option for the data sampling, and with noise_level_rn in the metadata. The Trainer needs to be copied from Sophies branch, currently we only get so far commit 6d909d6 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Thu Nov 27 11:32:32 2025 +0100 add mask to SampleMetaData and add forecast_dt to Sample so it is accessible. Can specify the loss in the default config with student-teacher views commit e0d7346 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Wed Nov 26 14:31:52 2025 +0100 remove prints, pdb commit c27156c Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Wed Nov 26 12:35:03 2025 +0100 add SampleMetaData integration and functionality, and update masker to use SampleMetadata. Pass through source_cell_lens and target_coords_idx to student_teacher_batch in iter, and hence pass through to trainer. source_cell_lens and target_coords_idx are now part of Sample, which is itself the components of ModelBatch. To tidy commit 4f8f62b Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Tue Nov 25 18:56:56 2025 +0100 instructions for sophie commit fa24fc1 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Tue Nov 25 16:36:52 2025 +0100 very hacky first pass of full masking_strategy_config for the student and teacher views. Much to fix up commit b193a50 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Mon Nov 24 17:13:37 2025 +0100 updated configs so code runs. Note default config to be overhauled still commit af9a3c1 Merge: 2905cb0 b452bd2 Author: Sebastian Hickman <seb.hickman@gmail.com> Date: Mon Nov 24 16:37:55 2025 +0100 merge with develop, include trainer idx_inv_rt, merged default_config, rm tokenizer_forecast commit 2905cb0 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Sat Nov 22 13:59:37 2025 +0000 fix masking for NPP-ATMS by correctly selecting final timestep mask and aligning between source and target. working for num_input_steps = 1, broken for > 1, compute_offsets_scatter_embed not working commit b9a60f3 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 18:38:40 2025 +0000 tidy up, remove unused arguments, types commit ece1dd0 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 16:22:27 2025 +0000 move build_views_for_stream into masker commit 1a418bf Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 12:54:33 2025 +0000 add max_num_samples functionality to tokenizer_masking and pass through in multi_stream_data_sampler. coords_per_cell is a bit nasty commit 91c3d7a Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Fri Nov 21 12:53:31 2025 +0000 add max_num_targets to era5 commit 647e4b2 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 18:31:45 2025 +0000 multiple idxs for each teacher, need to confirm for not student case, and updated ModelBatch for this commit 1806ae5 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 16:28:30 2025 +0000 tidy up, remove unused build_stream_views in tokenizer_masking commit 9b702c5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 14:34:34 2025 +0100 Re-enabling inversion of targert ordering. commit 87ad45f Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:10:34 2025 +0000 add teacher num_views parameter to config commit b34b6da Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:09:19 2025 +0000 collect num_source_samples and num_target_samples, add loop over teacher masks hence allowing multiple teacher views, and add source_target_idx to keep track of which student belongs to which teacher commit b2be982 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Thu Nov 20 13:07:47 2025 +0000 fix typo in ModelBatch commit d18cf86 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:26:40 2025 +0100 Added todo commit e8ccb8d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:22:26 2025 +0100 Added required reflexivity between source and target samples to Batch commit 5d5e999 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:21:31 2025 +0100 Linting problems but removed unused ViewMetaData dependence commit 3bca490 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:21:13 2025 +0100 linting commit 6a96065 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:20:42 2025 +0100 Linting commit c1d32fb Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 20 08:20:21 2025 +0100 linting commit 1b1654c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 22:32:05 2025 +0100 Added basic support for use of ModelBatch class to define rough structure and interface. commit 848880b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 20:06:41 2025 +0100 Renaming and minor clean up. commit 6d685c0 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:57:46 2025 +0100 Moved _get_student_teacher_masks() so that masks are generated for all streams first. commit ed26c02 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:57:23 2025 +0100 Changes to have spoofing on a per data reader sample commit 9fe94f5 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 19:30:48 2025 +0100 Changes necessary for spoofing flag per IOReaderData commit 4613f7a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:58:10 2025 +0100 Cleaned up parametrization commit 1235aab Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:47:40 2025 +0100 More refactoring. Code working again. commit 1e70f5c Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:09:20 2025 +0100 More refactoring and cleanup commit 46147d4 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 17:01:29 2025 +0100 More refactoring commit 81cf929 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 15:58:57 2025 +0100 Changes for better student teacher structure commit dfc03f2 Merge: a824bfc 31dc658 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 15:58:37 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit a824bfc Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 19 12:23:47 2025 +0100 Not working draft for restructuring commit 31dc658 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Wed Nov 19 11:04:29 2025 +0000 created function for _get_student_teacher_sample_data which returns the streams_data of the teacher and multiple streams_datas for the student views. commit 2536cec Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:40:26 2025 +0000 correct imports with new batch.py commit b3dfa2f Merge: 11ad4e6 c1580c4 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:36:15 2025 +0000 merge changes commit 11ad4e6 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:34:19 2025 +0000 basic if statement to yield the student and teacher views commit 36ea287 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:33:53 2025 +0000 slight restructure of ViewMetadata commit 66cf9cd Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:33:08 2025 +0000 added stream id to era5 config commit 3c26ddc Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Tue Nov 18 17:32:00 2025 +0000 updated default config training_config to allow student-teacher commit c1580c4 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 16:30:44 2025 +0100 Renaming commit 85fa139 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 16:28:46 2025 +0100 Comments commit dd6f85a Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 15:30:22 2025 +0100 Added mode and refactored get_sample_data into separate function. commit 668912d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 13:47:40 2025 +0100 Partially enabled correct handling of multiple input steps. commit c3b5c3b Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 12:02:17 2025 +0100 Added basic support for multi-step sources. commit ab9eecc Merge: a934f97 c733280 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 10:00:37 2025 +0100 Merge branch 'shmh40/dev/1270-idx-global-local' of github.com:ecmwf/WeatherGenerator into shmh40/dev/1270-idx-global-local commit a934f97 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Tue Nov 18 09:58:19 2025 +0100 NOT WORKING: updating class to handle multiple input steps and improving overall structure commit c733280 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:32:40 2025 +0000 change view_metadata to dict in ModelInput commit 7d5c300 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:22:33 2025 +0000 draft of training_config in default_config commit 047b299 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:19:56 2025 +0000 draft changes to allow global local view generation in masker and tokenizer_masking. generate the mask, otherwise using batchify_source and batchify_target as before, with the capacity to remember what mask we have now when it comes to generating the targets. Update to inputs_metadata structure but not put in to practice commit 761e263 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 18:13:57 2025 +0000 update ViewMetadata spec commit 7f3c718 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Mon Nov 17 14:51:01 2025 +0100 Updating config to working version commit ae5a2e6 Author: Sebastian Hickman <seb.hickman@ecmwf.int> Date: Mon Nov 17 11:54:18 2025 +0000 added file with ModelBatch and SampleMetadata dataclasses commit debbb8f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Mon Nov 17 12:28:07 2025 +0100 Changes to prepare_logging to apply index inversion commit 5d127bf Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Sun Nov 16 17:01:08 2025 +0100 Inversion of target output ordering to match input one in forcast mode. Unclear how to deal with it with MTM commit 8fa544d Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 20:43:57 2025 +0100 Removed unused parameters commit ce6c735 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 16:56:51 2025 +0100 Removing centroids options for embedding that was unused and should not be used. commit 0634105 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 09:59:13 2025 +0100 Enabled support for forecast. Cleaned up some bits and pieces. commit ec38123 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Fri Nov 14 08:27:21 2025 +0100 Fixed remaining problems that occured for NPP-ATMS and SYNOP. TODO: - Forecast still needs to be adapted - Some more cleanup of variable naming, return values etc commit db6f285 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:26:31 2025 +0100 Fixed linting commit 9229e48 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:19:21 2025 +0100 Minor cleanup commit a581405 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 23:17:29 2025 +0100 Working version for ERA5, NPP-ATMS. Problems with SYNOP with empty cell handling commit e4a9cc0 Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 18:58:28 2025 +0100 Masking target is working in principle but errors when feeding data to the model. commit 51f437f Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Thu Nov 13 07:04:23 2025 +0100 NOT WORKING: Finished src, target still to be done. commit 81bd6eb Author: Christian Lessig <christian.lessig@ecmwf.int> Date: Wed Nov 12 09:38:53 2025 +0100 NOT WORKING: initial draft for index-based masking. Implemented for random and healpix masking. Open issues with _coords_local, centroids and probably other things. * batch * adjusted to develop * one line * tiny fix * better messaging * incorporate requested changes * remove extra layer norms (ecmwf#1589) * Iluise/fix lead time (ecmwf#1571) * implement reader merge * working version of merge reader * linter * lint * fix lead time * update to develop * [1539][infra] Adds base config flag (ecmwf#1573) * set base config (ecmwf#1539) * update help message * longer variable name * longer variable name * rename config variable * rename base_configs --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Revised config and code quality improvements (ecmwf#1541) * Partially revised config; model is still missing but proper setup of training_config and validation_config * Changes necessary due to changed position of time keys and of run_id * Handling of multiple loss terms / target_aux_calculators and non-LossPhysical ones. * Changed position of run_id in config * Add function to extract batch size from mode_cfg * Changed position of run_id in config * Changes due to revised config. Also proper handling of target_aux_calculator and various other details cleaned up * Revised config structure, in particular for losses, and related changes * Add missing copyright and minor changed to to_device() * Moved sanity checking from trainer here. Also learning_rate sub_part of config is passed to LRScheduler, which leads to major simplifications * Minor cleanups * Changes due to changed structure of losses in config * Changes due to changed structure of losses in config * Minor changes due to changed position of run_id in config * Minor changes to accomodate new config, in particular target_aux_calculator config * Support batch_size > 1. Clean up of various smaller parts * Clean up and implementation for batch_size > 1. * Fix to sharding problem with FSDP2 * Removed scatter offset computation which now happens on the fly in the model * Changes for revised config, simplify overall where possible * Fix issues with source-target sample generation and matching. Work in progress * Linting * Linting * Linting * Linting * Type hint * Linting * Linting * Linting * Renamed loss keys for consistency * implement reader merge * Long list of fixes and improvements * Enabled support for minimal configs without rate * Fixed validation. validation_io still broken * Fixed linting * Fixed problem with target filtering for loss computation for SSL losses * working version of merge reader * linter * lint * fix lead time * Re-instantiated per loss-fct source/target correspondences. Introduced idx and correspondence fields to per sample meta-data which makes correct correspondence for loss computation much easier. * Fixed problem with undefined variable * Revised config * Fixed bug with forecasting * Added sanity check for config * Fix bug with duplicate targets * Linting * Fixed problem when losses is not specified in validation config * Fix DINOv2 * Removed temporary patches; fixed properly in 10b7a28 * Linting * Patched validation IO. Needs to be fixed properly. * Removed unused function * Improved variable naming * Improved encapsulation of functionality: total_batch_size * Fixed broken inference * Fixed problem with test where incorrect config was used * Fixed processing and handling of spoof flag in loss calculation * Fixed problem with pure masking where forecast_steps were 0. Removed duplicate function introduced through merge problem * Fixed bug when output_streams is specified explicitly * Corrected config param for number of samples * Fixed bug in handling of spoof weight * Improved clarity of logging statements * Improved logging msgs * Fix sinkhorn knopp * Fix sinkhorn in multi-GPU mode * Removed some old comments * Fixed inference overwrites * Fixing empty output when masking * Intermediate stage to re-enable integration test * Adjusted thresholds * Renaming * Removing old config files * Adding copyright * Revised default_config. This is a minimal example config for simple training towards forecasting * Changed multiprocessing param * Adapation for new position of multiprocessing param * Adding example config that combines an SSL and physical loss term * More cleanup * Restoring some default values * Restoring default for decoder_type * update to develop * Fixed problem where parameter was expected in old config place * Fixed linting * Simplified interface * Re-enabled forecast step and location weighting * Linting * Using new option to have validate_before_training as an int arg that allows to specify number of samples; Added copyright statement * Added option to have validate_before_training as int argument (specifyiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing. * Refactored correspondence parsing * Sophiex/dev/teacher overrides (ecmwf#1557) * Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (ecmwf#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fixed missing default value * Bilinear decoder: adapt code for batchsize > 1 (ecmwf#1592) * Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Changed defaults * Linting * Fixed linting issue * Reverting to ERA5-only as default * Fixed problem with train_continue --------- Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * [1601] Remove hardcoded optimizer variable eps (ecmwf#1602) * rm hardcoded optimizer variable eps * set default for eps in optimizer * WeatherGenerator JSON reader (ecmwf#1461) * split WeatherGenReader functionality to allow reading only JSON adding weathergen JSON reader to develop * informative error when metrics are not there * restore JSONreader after rebase * JSONreader mostly restored * MLFlow logging independent of JSON/zarr * linting, properly cheking fsteps, ens, samples in JSONreader * tiny change to restore the MergeReader * lint --------- Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> * Filter configs using enabled flag (ecmwf#1604) * Partially revised config; model is still missing but proper setup of training_config and validation_config * Changes necessary due to changed position of time keys and of run_id * Handling of multiple loss terms / target_aux_calculators and non-LossPhysical ones. * Changed position of run_id in config * Add function to extract batch size from mode_cfg * Changed position of run_id in config * Changes due to revised config. Also proper handling of target_aux_calculator and various other details cleaned up * Revised config structure, in particular for losses, and related changes * Add missing copyright and minor changed to to_device() * Moved sanity checking from trainer here. Also learning_rate sub_part of config is passed to LRScheduler, which leads to major simplifications * Minor cleanups * Changes due to changed structure of losses in config * Changes due to changed structure of losses in config * Minor changes due to changed position of run_id in config * Minor changes to accomodate new config, in particular target_aux_calculator config * Support batch_size > 1. Clean up of various smaller parts * Clean up and implementation for batch_size > 1. * Fix to sharding problem with FSDP2 * Removed scatter offset computation which now happens on the fly in the model * Changes for revised config, simplify overall where possible * Fix issues with source-target sample generation and matching. Work in progress * Linting * Linting * Linting * Linting * Type hint * Linting * Linting * Linting * Renamed loss keys for consistency * implement reader merge * Long list of fixes and improvements * Enabled support for minimal configs without rate * Fixed validation. validation_io still broken * Fixed linting * Fixed problem with target filtering for loss computation for SSL losses * working version of merge reader * linter * lint * fix lead time * Re-instantiated per loss-fct source/target correspondences. Introduced idx and correspondence fields to per sample meta-data which makes correct correspondence for loss computation much easier. * Fixed problem with undefined variable * Revised config * Fixed bug with forecasting * Added sanity check for config * Fix bug with duplicate targets * Linting * Fixed problem when losses is not specified in validation config * Fix DINOv2 * Removed temporary patches; fixed properly in 10b7a28 * Linting * Patched validation IO. Needs to be fixed properly. * Removed unused function * Improved variable naming * Improved encapsulation of functionality: total_batch_size * Fixed broken inference * Fixed problem with test where incorrect config was used * Fixed processing and handling of spoof flag in loss calculation * Fixed problem with pure masking where forecast_steps were 0. Removed duplicate function introduced through merge problem * Fixed bug when output_streams is specified explicitly * Corrected config param for number of samples * Fixed bug in handling of spoof weight * Improved clarity of logging statements * Improved logging msgs * Fix sinkhorn knopp * Fix sinkhorn in multi-GPU mode * Removed some old comments * Fixed inference overwrites * Fixing empty output when masking * Intermediate stage to re-enable integration test * Adjusted thresholds * Renaming * Removing old config files * Adding copyright * Revised default_config. This is a minimal example config for simple training towards forecasting * Changed multiprocessing param * Adapation for new position of multiprocessing param * Adding example config that combines an SSL and physical loss term * More cleanup * Restoring some default values * Restoring default for decoder_type * update to develop * Fixed problem where parameter was expected in old config place * Fixed linting * Simplified interface * Re-enabled forecast step and location weighting * Linting * Using new option to have validate_before_training as an int arg that allows to specify number of samples; Added copyright statement * Added option to have validate_before_training as int argument (specifyiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing. * Refactored correspondence parsing * Sophiex/dev/teacher overrides (ecmwf#1557) * Add option to modify teacher TODO fix ema update * Fix EMA under teacher and student model differences * Attempt to revert newline * Raise error if teacher has weights not in student * Clessig/sophiex/dev/teacher overrides (ecmwf#1585) * Simplified error message * Added support for target_and_aux configs * Fix bug that validation EMA params are not used * Removing unused/superfluous function * Removed debug statement * Changed config so that target_aux params are specified as dict at the appropriate place --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Fixed missing default value * Bilinear decoder: adapt code for batchsize > 1 (ecmwf#1592) * Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> * Changed defaults * Linting * Fixed linting issue * Reverting to ERA5-only as default * Fixed problem with train_continue * Adding filtering of config based on enabled/disabled --------- Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * Fixed bug with frequency parameter (ecmwf#1611) * Fix problem with str indices in source/target config (ecmwf#1619) * Move register & class tokens to be added earlier * Fix problem with str indices in source/target config * Fixed comment --------- Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> * Sorcha/dev/zarr3 compaction (ecmwf#1450) * update dependencies to zarr3/experimental anemoi (ecmwf#1253) * upper-bounding eccodes * zarr3 changes * linting * porblem with new evaluate dependencies (removed temporarily for testing common.io) * revert pyproject * first draft * commit to merge * commit to change branch * trying to remove metadata (too many zarr.json files) * zipstore * working (lot of debug prints to remove) * adding flag * WIP: adding flag * neaten up * wrapping zarruserwarning + linting * changes * fixing warnings * fixes * groups * change writer * switch group * reverting, issue is more complex than thought * post review changes * linting * fixing zarrio * linting * fixing create default arg * small change to fix export * linting * Simon/zarr3 compaction/refactoring (ecmwf#1553) * make zarrio subclasses * store string literals for output storage in enum. * debugging * small fix for export * removing stream_dict lines in run_evaluation * removing timeit * pyproject.toml removing change * adding comment * removing zarrio writer in trainer.py * Set output dataset metadata in creation of zarr group to avoid incremental metadata writes. (ecmwf#1593) * WIP:removing zarr_store flag * fixing duplication error * need mode="a" to avoid overwriting * adding comments for mode = "a" * debug w/prints * renaming reader to avoid conflict * debugging * renaming zarrio writer/reader to avoid conflicts * lint-check fix * type-check fixes * lint fix * type-check errors * revert lead_time fix * tidying --------- Co-authored-by: Simon Grasse <161459968+grassesi@users.noreply.github.com> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> * Removed unused mask_params return value (ecmwf#1626) * remove unused config parameter (ecmwf#1632) * Update eval_config.yml (ecmwf#1636) Add some supports that are missing as comments. * Jk/develop/1639 fix shard val forward (ecmwf#1642) * rm model_forward assignment in val * rm clutter from diffusion branch * reverse if order * Clessig/develop/fix finetuning 1640 (ecmwf#1641) * Fix bug with diagnostic streams * Avoid that empty decoders are allocated * Sophiex/dev/synop nppatms finetuning configs (ecmwf#1644) * Doing something wrong * Make fine-tuning work * Rename sensibly * Enable multiple student views for one target for JEPA (ecmwf#1617) * Enable multiple student views for one target * Improved readability * Fix test for empty targets in decoder creation (ecmwf#1646) * add regions to integration tests (ecmwf#1648) * Memory pinning (ecmwf#1615) * add pin mem to IOReaderData * add pin mem to sample & modelbatch class * add pin mem to stream data * add pin mem to training loop * run /scripts/actions.sh lint * run ./scripts/actions.sh unit-test * ignore check torch import in package * move pinning to MultiStreamDataSampler * add _pin_tensor & _pin_tensor_list helper func * ruff the code * move back pin mem. to train loop * Remove the ignore-import-error rule and revert to the state before the change * create protocol for pinnable obj * remove pin_mem from IOReaderData class * add pin_memory to Trainer.validate * remove pin_memory from loader_params * Rever export/export_inference.py to state before c3fc9a7 * change name * revise Pinnable class description * add memory_pinning in config, train & va loop * use getattr to avoid CICD warning * use setattr to avoid CICD warning * disable pylint for self.source_tokens_lens * Fixed issues with memory pinning due to rebasing and also adjusted config position of flag * Reverting unadvert changes --------- Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Javad Kasravi <jkasravi@santis-ln002.cscs.ch> Co-authored-by: Javad kasravi <kasravi66@gmail.com> * Allows for writing normalized samples; fixed config to keep it well-structured (ecmwf#1653) * Skipping missing scores in JSONreader (ecmwf#1655) * split WeatherGenReader functionality to allow reading only JSON adding weathergen JSON reader to develop * informative error when metrics are not there * restore JSONreader after rebase * JSONreader mostly restored * MLFlow logging independent of JSON/zarr * linting, properly cheking fsteps, ens, samples in JSONreader * tiny change to restore the MergeReader * lint * enabling JSONreader to skip plots and missing scores gracefully * required reformatting * move skipping of metrics to the reader class * slighly more explicit formulations --------- Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> * Remove mini_epoch backward compatibility v2 --------- Co-authored-by: iluise <72020169+iluise@users.noreply.github.com> Co-authored-by: Moritz Hauschulz <60788263+moritzhauschulz@users.noreply.github.com> Co-authored-by: kctezcan <kctezcan@gmail.com> Co-authored-by: Michael Tarnawa <18899420+mtar@users.noreply.github.com> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int> Co-authored-by: Ilaria Luise <luise.ilaria@gmail.com> Co-authored-by: Sophie Xhonneux <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Julian Kuehnert <Jubeku@users.noreply.github.com> Co-authored-by: s6sebusc <49226935+s6sebusc@users.noreply.github.com> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln001.cscs.ch> Co-authored-by: Sebastian Buschow <sbuschow@santis-ln002.cscs.ch> Co-authored-by: Sorcha Owens <73587207+enssow@users.noreply.github.com> Co-authored-by: Simon Grasse <161459968+grassesi@users.noreply.github.com> Co-authored-by: Tim Hunter <tim.hunter@ecmwf.int> Co-authored-by: Savvas Melidonis <79579567+SavvasMel@users.noreply.github.com> Co-authored-by: Javad Kasravi <j.kasravi@fz-juelich.de> Co-authored-by: Javad Kasravi <jkasravi@santis-ln002.cscs.ch> Co-authored-by: Javad kasravi <kasravi66@gmail.com>

clessig added 21 commits December 30, 2025 23:14

Partially revised config; model is still missing but proper setup of …

b98d074

…training_config and validation_config

Changes necessary due to changed position of time keys and of run_id

bb12d5a

Handling of multiple loss terms / target_aux_calculators and non-Loss…

d510956

…Physical ones.

Changed position of run_id in config

793578e

Add function to extract batch size from mode_cfg

6eabd27

Changed position of run_id in config

d350b38

Changes due to revised config. Also proper handling of target_aux_cal…

78b17af

…culator and various other details cleaned up

Revised config structure, in particular for losses, and related changes

d8a1291

Add missing copyright and minor changed to to_device()

0d4e471

Moved sanity checking from trainer here. Also learning_rate sub_part …

b99b5c9

…of config is passed to LRScheduler, which leads to major simplifications

Minor cleanups

9ef940b

Changes due to changed structure of losses in config

868e595

Changes due to changed structure of losses in config

f005ef0

Minor changes due to changed position of run_id in config

7b1d189

Minor changes to accomodate new config, in particular target_aux_calc…

53eb0d0

…ulator config

Support batch_size > 1. Clean up of various smaller parts

cdbb696

Clean up and implementation for batch_size > 1.

0b99f3e

Fix to sharding problem with FSDP2

7d1226f

Removed scatter offset computation which now happens on the fly in th…

0ca381d

…e model

Changes for revised config, simplify overall where possible

4d67ad2

Fix issues with source-target sample generation and matching. Work in…

66c83a2

… progress

github-project-automation Bot added this to WeatherGen-dev Dec 30, 2025

github-actions Bot added infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Dec 30, 2025

clessig added 6 commits December 30, 2025 23:29

Linting

fff5749

Linting

f28874b

Linting

192930a

Linting

32243f3

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into cles…

0d11f87

…sig/develop/fix_config_1534

Type hint

0b900d3

clessig changed the title ~~Clessig/develop/fix config 1534~~ Revised config and code quality improvements Jan 12, 2026

clessig and others added 16 commits January 12, 2026 21:45

Fixed problem where parameter was expected in old config place

da42ad6

Fixed linting

e4e3922

Simplified interface

d741e2f

Re-enabled forecast step and location weighting

bcd561f

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into cles…

8b0bb12

…sig/develop/fix_config_1534

Merge branches 'develop' and 'clessig/develop/fix_config_1534' of git…

72d92f9

…hub.com:ecmwf/WeatherGenerator into clessig/develop/fix_config_1534

Merge branch 'develop' of github.com:ecmwf/WeatherGenerator into cles…

41c85a3

…sig/develop/fix_config_1534

Linting

a4f3eed

Using new option to have validate_before_training as an int arg that …

78a7cd4

…allows to specify number of samples; Added copyright statement

Added option to have validate_before_training as int argument (specif…

4bfea4b

…yiung the number of samples). Fixed some minor subtle problems in validate() to fully distinguish validation and testing.

Refactored correspondence parsing

9123bae

Fixed missing default value

841e027

Bilinear decoder: adapt code for batchsize > 1 (#1592)

5f2cb75

* Adapt code for batchsize > 1 * Fixed comment --------- Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>

Changed defaults

71d43df

Linting

80e9181

sophie-xhonneux approved these changes Jan 13, 2026

View reviewed changes

clessig added 3 commits January 13, 2026 18:29

Fixed linting issue

c4bd337

Reverting to ERA5-only as default

aae83c0

Fixed problem with train_continue

eb235de

clessig merged commit 0d502ed into develop Jan 13, 2026
5 checks passed

github-project-automation Bot moved this from In Progress to Done in WeatherGen-dev Jan 13, 2026

clessig deleted the clessig/develop/fix_config_1534 branch January 13, 2026 18:12

grassesi mentioned this pull request Jan 27, 2026

Reenable evaluation in integration tests #1712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revised config and code quality improvements#1541

Revised config and code quality improvements#1541
clessig merged 109 commits intodevelopfrom
clessig/develop/fix_config_1534

clessig commented Dec 30, 2025 •

edited

Loading

Uh oh!

clessig commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

clessig commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clessig commented Dec 30, 2025 •

edited

Loading