Sgh ensemble generator template#942
Draft
alexolinhager wants to merge 43 commits intoMPAS-Dev:mainfrom
Draft
Conversation
Generalize ensemble generator to support multiple model configurations used for different studies. * Introduced a new configuration module to handle model configurations for ensemble generation. * Updated `BranchRun` and `EnsembleMember` classes to accept a `resource_module` parameter for dynamic configuration loading. * Created default configuration files for branch and spinup ensembles, including necessary namelists and stream definitions. * Modified the main ensemble generator configuration to streamline the setup process and improve clarity. * Enhanced error handling for missing configuration sections and options. * Updated the `SpinupEnsemble` class to utilize the new configuration methods for improved modularity and maintainability.
Variations on the word configuration are already too widespread so this change should reduce confusion.
The primary new functionality is the ability to support any namelist option as a parameter rather than only pre-defined parameters. The refactor also simplifies how parameter values are specified and puts all parameters in a dedicated cfg section. More details on the format are included in the updated docs.
Remove unnecessary extra section level, Separate spinup_ensemble options from options general to the whole test group.
This function was flagged as too complex, so Copilot helped me break it into smaller functions.
Update ensemble generator parsing and docs based on outdated but still relevant Copilot review feedback from PR MPAS-Dev#940. - Sanitize multiline option parsing in spinup_ensemble._split_entries to remove continuation backslashes before tokenization. - Use importlib.resources.as_file() when handling optional albany_input.yaml in ensemble_member setup. - Clarify in developer docs that albany_input.yaml is copied only when present for Albany-based configurations. - Fix users-guide cfg block indentation and multiline .option_name examples to match ConfigParser behavior.
Still need to set up branch ensemble This commit also makes basal melt and TF input files optional to be more compatible with SGH configurations.
Introduces new test cases within the ensemble generator: sgh_analysis - Tests for steady state and validates runs against specularity content. Creates json files with steady-state/validation metric for each run sgh_results - Identifies runs that have not yet reached steady state and creates a new enemble command to restart these runs. Functionality still needs testing. Eventually both of these test cases could be moved within the ensemble manager to perform automatically.
71bb665 to
cefe523
Compare
cefe523 to
291d679
Compare
…ager + sbatch) Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/997f5081-01e5-4bbd-b749-6c75d556af70 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…er-class Replace RestartMember with InPlaceRestartMember (EnsembleManager + sbatch)
…nd validate_mali_with_spec.py Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/dc37c705-1cb3-4115-998d-a70bb29f03a5 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…t-to-scripts Add `--plot_dir` argument to subglacial analysis scripts
6df4141 to
b644f58
Compare
Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/9182fc24-d2dd-4907-9289-b21d7182f502 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…n-utility [WIP] Add module-level helper to sanitize numpy types for JSON
Bug #1: config.get('restart_ensemble', {}) crashes because MpasConfigParser.get() expects (section, option) positional args, not a dict fallback. Fixed: config['restart_ensemble'] returns a SectionProxy with proper .get()/.getint()/.getfloat()/.getboolean() methods. Bug #2: _should_restart_run() looked for per-run analysis_results.json files that are never written. AnalysisStep writes analysis_summary.json to its own work dir containing an individual_results dict for all runs. Fixed: add analysis_summary_file config option; configure() loads the file and passes per-run dicts to _should_restart_run() via a new run_results param. RestartScheduler.create_config_file() now includes analysis_summary_file in generated configs. Bug #3: restart_attempt_N/ tracking dirs were never created by InPlaceRestartMember.setup(), so max_consecutive_restarts was effectively disabled and all attempt counters read 0. Fixed: setup() now creates restart_attempt_N/ dirs using a single os.listdir() call to find the highest existing attempt number. Bug #5: restart_scheduler.py docstring Examples section referenced a non-existent module path. Fixed to the correct path. Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/ca2d29bf-1246-415c-bf2c-9de7521fa55f Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
compass's MpasConfigParser does not support the fallback= keyword argument on getint/getfloat/getboolean/get. The correct compass pattern is to load the shipped .cfg defaults via config.add_from_package() at the start of configure() so that all options are resolvable without fallback= kwargs. Changes: - test_case.py: call self.config.add_from_package() at the top of configure() to register ensemble_generator.cfg as the source of defaults, then remove all fallback= kwargs from getint/getfloat/getboolean/get calls - ensemble_generator.cfg: change analysis_summary_file from the REPLACE_WITH_... sentinel to `None` (the shipped default) so the existing `!= 'none'` guard correctly treats it as "not provided" Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/219c1314-48f3-4507-852a-abf8e87511e4 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…le-generator-structure Fix critical bugs in sgh_restart_ensemble test case
…artMember - Remove write_job_script() call that overwrote the original job_script.sh - Remove add_model_as_input() call (unnecessary and potentially destructive) - Remove ntasks/min_tasks/config.set/machine/write_job_script config block - Remove symlink for load_compass_env.sh - Remove unused imports: configparser, compass.io.symlink, compass.job.write_job_script - Update docstrings to accurately reflect the simplified setup() behavior Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0a459658-101e-4f84-8914-c88d0b8e7385 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…write Fix: InPlaceRestartMember no longer overwrites original job_script.sh
Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/7a970c77-daa6-4f99-8bd8-d5be6f7cd833 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…cript Fix InPlaceRestartMember overwriting spinup job_script.sh
… steps Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/6078fd58-c3e9-490f-9c8d-ec2b66611615 Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…in-place-restart Fix EnsembleManager to rewrite job_script.sh for InPlaceRestartMember steps
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a 3-stage workflow for running ensembles of MALI subglacial hydrology simulations. The workflow includes a new ensemble template for SGH spinup runs, an SGH analysis test case that determine which runs are at steady-state, data-compatible, or should be flagged for restart, as well as a SGH restart test case that restarts runs not yet at steady state. The general workflow is:
create and run SGH spinup ensemble with landice/ensemble_generator/spinup_ensemble, using the sgh_ensemble_template
Use landice/ensemble_generator/sgh_ensemble_analysis to determine which runs are at steady state and are data-compatible with radar specularity content (using a balanced accuracy). Runs not at steady state and are flagged for restart.
Use landice/ensemble_generator/sgh_restart_ensemble to restart flagged runs.
Steps 1 and 2 have been tested successfully for multiple ensembles, but the restart capabilities have yet to be tested. Everything should be in place for the restart test case to work, but earlier testing corrupted necessary
step.picklefiles in the test ensemble. It may be necessary to redo a small ensemble from scratch to test the full work flow.Thresholds for calculating steady-state have not been evaluated, so a review of the analysis .png files will be necessary to make sure they are tuned appropriately. The threshold value of 0.65 for the Balanced Accuracy validation does seem to work well for AIS.
As of 4/2/26, the landice/ensemble_generator/branch_ensemble capabilities are not supported for the sgh_ensemble_template.
Much of this PR was written through iterations with copilot, and the commit history is admittedly somewhat of a mess. The end result should all be functional, but individual commits may be convoluted. Below is the copilot-generated User Guide:
SGH Ensemble Workflow — User's Guide
This guide covers the complete three-stage workflow for running a subglacial hydrology parameter ensemble with MALI using compass:
Overview
The SGH ensemble workflow generates an ensemble of MALI ice-sheet simulations
in which key physical parameters (e.g. basal friction exponent, geothermal
heat flux, basal melt parameters) are sampled across prescribed ranges using
Latin Hypercube or Sobol sampling. Runs that do not reach the prescribed stop
time or steady state in the initial spinup can be continued automatically
by the restart ensemble stage.
Prerequisites
parameter file, SMB file (as applicable)
perlmutter,chrysalis)Stage 1: spinup_ensemble
Purpose
Runs an ensemble of MALI simulations in parallel via SLURM, each with a
unique combination of parameter values drawn from the configured sampling
strategy.
Configuration
Create a config file (e.g.
spinup_ensemble.cfg):Setup and run
compass setup \ -t landice/ensemble_generator/spinup_ensemble \ -w /path/to/spinup_work_dir \ -f spinup_ensemble.cfg \ -m <machine> cd /path/to/spinup_work_dir compass runcompass runlaunchesEnsembleManager, which submits each run as anindependent SLURM job via
sbatch job_script.sh. The runs execute inparallel. Each run directory (
run000/,run001/, ...) contains:namelist.landice— MALI namelist with parameter values appliedstreams.landice— MALI streams filejob_script.sh— SLURM batch script that callscompass runrun_info.cfg— record of parameter values for that runrestart_timestamp— written by MALI when a checkpoint is savedoutput/globalStats.nc— MALI output diagnosticsSampling methods
soboluniformlog-uniformStage 2: sgh_ensemble_analysis
Purpose
Reads the completed (or partially completed) spinup ensemble directory,
analyzes each run for steady-state and data compatibility, and writes an
analysis_summary.jsonthat Stage 3 uses to identify which runs needcontinuation.
What it checks
a rolling window (configurable via
steady_state_window_yearsandsteady_state_imbalance_threshold)observations above a configurable accuracy threshold
config_stop_timeConfiguration
Create a config file (e.g.
analysis_ensemble.cfg):Setup and run
compass setup \ -t landice/ensemble_generator/sgh_ensemble_analysis \ -w /path/to/analysis_work_dir \ -f analysis_ensemble.cfg \ -m <machine> cd /path/to/analysis_work_dir compass runOutput
analysis_summary.jsonis written to:It contains:
{ "ensemble_dir": "/path/to/spinup_ensemble", "steady_state_runs": [2, 5, 11, ...], "data_compatible_runs": [2, 5, 7, ...], "both_criteria_runs": [2, 5, ...], "restart_needed_runs": [0, 1, 3, 4, ...], "individual_results": { "0": { "steady_state": { "is_steady_state": false, "metrics": { "final_year": 87.3, ... } }, ... }, ... } }Optional: use RestartScheduler to pre-screen candidates
This generates a ready-to-use
restart_ensemble.cfginnew_work_dir.Stage 3: sgh_restart_ensemble
Purpose
Continues incomplete ensemble runs in-place — in their original spinup
run directories — without copying any files. Only
namelist.landiceismodified (setting
config_do_restart = .true.). The existing MALI restartfile is picked up automatically.
Run selection logic
During
compass setup, a run from the spinup ensemble is scheduled forrestart if all of the following are true:
output/globalStats.ncexistsrestart_timestampexistsrestart_timestamp≠config_stop_timemin_simulation_years_before_restartanalysis_summary.json)max_consecutive_restartsprevious attemptsauto_restart_incomplete = TrueConfiguration
Create a config file (e.g.
restart_ensemble.cfg), or use the one generatedby
schedule_restarts()above:Setup and run
compass setup \ -t landice/ensemble_generator/sgh_restart_ensemble \ -w /path/to/restart_work_dir \ -f restart_ensemble.cfg \ -m <machine> cd /path/to/restart_work_dir compass runcompass runlaunchesEnsembleManager, which:job_script.shin the spinup run dir tocdto the compassstep directory before calling
compass run(sostep.pickleand theconfig file can be found on the compute node)
sbatchMALI resumes from the last restart file in the original spinup run directory.
What changes in the spinup run directory
namelist.landicecompass setupconfig_do_restartset to.true.restart_attempt_N/compass setupjob_script.shcompass run(at submit time)cdto compass step dir beforecompass runNo data files, restart files, or input files are copied or moved.
Tracking restart attempts
Each time a run is scheduled for restart, a subdirectory
restart_attempt_1/,restart_attempt_2/, etc. is created in the spinuprun directory. These are counted by
configure()to enforcemax_consecutive_restartsacross multiplecompass setup/compass runinvocations.
Re-running analysis after restarts
After the restarted runs complete, run
sgh_ensemble_analysisagain pointingat the same spinup ensemble directory. It will pick up the new output and
update
analysis_summary.json. If further restarts are needed, repeatStage 3.
Full example workflow
Configuration reference
[ensemble_generator]start_runend_runmax_samplessampling_methodsobol,uniform, orlog-uniformntaskscfl_fractionensemble_templatesgh_ensemble)[spinup_ensemble]input_file_path[analysis_ensemble]ensemble_work_dirspec_tiff_file[restart_ensemble]spinup_work_diranalysis_summary_fileNoneanalysis_summary.jsonmax_consecutive_restarts3min_simulation_years_before_restart50.0auto_restart_incompleteTrueChecklist
api.rst) has any new or modified class, method and/or functions listedE3SM-Projectsubmodule has been updated with relevant E3SM changesMALI-Devsubmodule has been updated with relevant MALI changesTestingin this PR) any testing that was used to verify the changes