Skip to content

Sgh ensemble generator template#942

Draft
alexolinhager wants to merge 43 commits intoMPAS-Dev:mainfrom
alexolinhager:sgh_ensemble_generator_template
Draft

Sgh ensemble generator template#942
alexolinhager wants to merge 43 commits intoMPAS-Dev:mainfrom
alexolinhager:sgh_ensemble_generator_template

Conversation

@alexolinhager
Copy link
Copy Markdown

@alexolinhager alexolinhager commented Mar 16, 2026

This PR introduces a 3-stage workflow for running ensembles of MALI subglacial hydrology simulations. The workflow includes a new ensemble template for SGH spinup runs, an SGH analysis test case that determine which runs are at steady-state, data-compatible, or should be flagged for restart, as well as a SGH restart test case that restarts runs not yet at steady state. The general workflow is:

  1. create and run SGH spinup ensemble with landice/ensemble_generator/spinup_ensemble, using the sgh_ensemble_template

  2. Use landice/ensemble_generator/sgh_ensemble_analysis to determine which runs are at steady state and are data-compatible with radar specularity content (using a balanced accuracy). Runs not at steady state and are flagged for restart.

  3. Use landice/ensemble_generator/sgh_restart_ensemble to restart flagged runs.

Steps 1 and 2 have been tested successfully for multiple ensembles, but the restart capabilities have yet to be tested. Everything should be in place for the restart test case to work, but earlier testing corrupted necessary step.pickle files in the test ensemble. It may be necessary to redo a small ensemble from scratch to test the full work flow.

Thresholds for calculating steady-state have not been evaluated, so a review of the analysis .png files will be necessary to make sure they are tuned appropriately. The threshold value of 0.65 for the Balanced Accuracy validation does seem to work well for AIS.

As of 4/2/26, the landice/ensemble_generator/branch_ensemble capabilities are not supported for the sgh_ensemble_template.

Much of this PR was written through iterations with copilot, and the commit history is admittedly somewhat of a mess. The end result should all be functional, but individual commits may be convoluted. Below is the copilot-generated User Guide:

SGH Ensemble Workflow — User's Guide

This guide covers the complete three-stage workflow for running a subglacial hydrology parameter ensemble with MALI using compass:

  1. Stage 1: spinup_ensemble — Run the initial ensemble
  2. Stage 2: sgh_ensemble_analysis — Analyze which runs completed or reached steady state
  3. Stage 3: sgh_restart_ensemble — Continue incomplete runs in-place

Overview

The SGH ensemble workflow generates an ensemble of MALI ice-sheet simulations
in which key physical parameters (e.g. basal friction exponent, geothermal
heat flux, basal melt parameters) are sampled across prescribed ranges using
Latin Hypercube or Sobol sampling. Runs that do not reach the prescribed stop
time or steady state in the initial spinup can be continued automatically
by the restart ensemble stage.

spinup_ensemble  ──►  sgh_ensemble_analysis  ──►  sgh_restart_ensemble
  (run MALI)           (check steady state)         (continue in-place)
       │                       │                            │
  run000..runN          analysis_summary.json       namelist.landice edited
  job_script.sh         individual_results          restart_attempt_N/ created
  restart_timestamp     steady_state_runs           job submitted via sbatch

Prerequisites

  • A compiled MALI executable
  • A configured compass environment (see compass installation docs)
  • Input files: initial condition NetCDF, thermal forcing file, basal melt
    parameter file, SMB file (as applicable)
  • A machine config recognized by compass (e.g. perlmutter, chrysalis)

Stage 1: spinup_ensemble

Purpose

Runs an ensemble of MALI simulations in parallel via SLURM, each with a
unique combination of parameter values drawn from the configured sampling
strategy.

Configuration

Create a config file (e.g. spinup_ensemble.cfg):

[ensemble_generator]
# Run numbers to set up (inclusive)
start_run = 0
end_run = 31

# Total number of samples in the parameter vector
max_samples = 32

# Sampling strategy: sobol, uniform, or log-uniform
sampling_method = sobol

# Number of MPI tasks per run
ntasks = 128

# CFL fraction for adaptive timestepping
cfl_fraction = 0.7

# Template to use for namelist/streams files
ensemble_template = sgh_ensemble

[spinup_ensemble]
# Path to initial condition file
input_file_path = /path/to/input.nc

Setup and run

compass setup \
    -t landice/ensemble_generator/spinup_ensemble \
    -w /path/to/spinup_work_dir \
    -f spinup_ensemble.cfg \
    -m <machine>

cd /path/to/spinup_work_dir
compass run

compass run launches EnsembleManager, which submits each run as an
independent SLURM job via sbatch job_script.sh. The runs execute in
parallel. Each run directory (run000/, run001/, ...) contains:

  • namelist.landice — MALI namelist with parameter values applied
  • streams.landice — MALI streams file
  • job_script.sh — SLURM batch script that calls compass run
  • run_info.cfg — record of parameter values for that run
  • restart_timestamp — written by MALI when a checkpoint is saved
  • output/globalStats.nc — MALI output diagnostics

Sampling methods

Method Description
sobol Sobol quasi-random sequence; best space-filling coverage for multi-parameter studies
uniform Linearly spaced from min to max; useful for single-parameter sweeps
log-uniform Linearly spaced in log10 space; useful for parameters spanning orders of magnitude

Stage 2: sgh_ensemble_analysis

Purpose

Reads the completed (or partially completed) spinup ensemble directory,
analyzes each run for steady-state and data compatibility, and writes an
analysis_summary.json that Stage 3 uses to identify which runs need
continuation.

What it checks

  • Steady state: whether the water mass balance has stabilized within
    a rolling window (configurable via steady_state_window_years and
    steady_state_imbalance_threshold)
  • Data compatibility: whether simulated specularity content matches
    observations above a configurable accuracy threshold
  • Completion: whether the run reached config_stop_time

Configuration

Create a config file (e.g. analysis_ensemble.cfg):

[analysis_ensemble]
# Path to the spinup ensemble work directory (contains run000/, run001/, ...)
ensemble_work_dir = /path/to/spinup_work_dir/landice/ensemble_generator/spinup_ensemble

# (Optional) path to observed specularity content TIFF for data compatibility
# Leave unset or set to None if not available
spec_tiff_file = /path/to/specularity.tif

[analysis]
# Rolling window length (years) for steady-state detection
steady_state_window_years = 10.0

# Maximum allowed fractional imbalance to declare steady state
steady_state_imbalance_threshold = 0.05

# Minimum classification accuracy to declare data compatibility
balanced_accuracy_threshold = 0.65

Setup and run

compass setup \
    -t landice/ensemble_generator/sgh_ensemble_analysis \
    -w /path/to/analysis_work_dir \
    -f analysis_ensemble.cfg \
    -m <machine>

cd /path/to/analysis_work_dir
compass run

Output

analysis_summary.json is written to:

/path/to/analysis_work_dir/landice/ensemble_generator/sgh_ensemble_analysis/analyze_ensemble/analysis_summary.json

It contains:

{
  "ensemble_dir": "/path/to/spinup_ensemble",
  "steady_state_runs": [2, 5, 11, ...],
  "data_compatible_runs": [2, 5, 7, ...],
  "both_criteria_runs": [2, 5, ...],
  "restart_needed_runs": [0, 1, 3, 4, ...],
  "individual_results": {
    "0": {
      "steady_state": {
        "is_steady_state": false,
        "metrics": { "final_year": 87.3, ... }
      },
      ...
    },
    ...
  }
}

Optional: use RestartScheduler to pre-screen candidates

from compass.landice.tests.ensemble_generator.sgh_restart_ensemble \
    .restart_scheduler import schedule_restarts

config_file, restart_runs = schedule_restarts(
    summary_file='/path/to/analysis_summary.json',
    new_work_dir='/path/to/restart_work_dir',
    min_years=50.0,
    max_attempts=3
)
print(f"Config written to: {config_file}")
print(f"Runs to restart: {restart_runs}")

This generates a ready-to-use restart_ensemble.cfg in new_work_dir.


Stage 3: sgh_restart_ensemble

Purpose

Continues incomplete ensemble runs in-place — in their original spinup
run directories — without copying any files. Only namelist.landice is
modified (setting config_do_restart = .true.). The existing MALI restart
file is picked up automatically.

Run selection logic

During compass setup, a run from the spinup ensemble is scheduled for
restart if all of the following are true:

Check Description
output/globalStats.nc exists The run produced output
restart_timestamp exists The run started and checkpointed
restart_timestampconfig_stop_time The run did not already complete
simulation length ≥ min_simulation_years_before_restart Enough progress was made to be worth restarting
Not at steady state The run still needs continuation (from analysis_summary.json)
Fewer than max_consecutive_restarts previous attempts The restart limit has not been reached
auto_restart_incomplete = True Auto-restart is enabled

Configuration

Create a config file (e.g. restart_ensemble.cfg), or use the one generated
by schedule_restarts() above:

[restart_ensemble]

# REQUIRED: path to the spinup ensemble directory
spinup_work_dir = /path/to/spinup_work_dir/landice/ensemble_generator/spinup_ensemble

# RECOMMENDED: path to analysis_summary.json
analysis_summary_file = /path/to/analysis_work_dir/landice/ensemble_generator/sgh_ensemble_analysis/analyze_ensemble/analysis_summary.json

# Maximum restart attempts per run before giving up (default: 3)
max_consecutive_restarts = 3

# Minimum simulation years before a run is eligible for restart (default: 50.0)
min_simulation_years_before_restart = 50.0

# Automatically restart all eligible runs (default: True)
auto_restart_incomplete = True

[ensemble]
ntasks = 128
cfl_fraction = 0.7

Setup and run

compass setup \
    -t landice/ensemble_generator/sgh_restart_ensemble \
    -w /path/to/restart_work_dir \
    -f restart_ensemble.cfg \
    -m <machine>

cd /path/to/restart_work_dir
compass run

compass run launches EnsembleManager, which:

  1. Loops over all registered restart steps
  2. Checks each spinup run dir for error logs or completion
  3. Updates job_script.sh in the spinup run dir to cd to the compass
    step directory before calling compass run (so step.pickle and the
    config file can be found on the compute node)
  4. Submits the job via sbatch

MALI resumes from the last restart file in the original spinup run directory.

What changes in the spinup run directory

File/directory When What
namelist.landice compass setup config_do_restart set to .true.
restart_attempt_N/ compass setup Empty tracking directory created
job_script.sh compass run (at submit time) Rewritten to cd to compass step dir before compass run

No data files, restart files, or input files are copied or moved.

Tracking restart attempts

Each time a run is scheduled for restart, a subdirectory
restart_attempt_1/, restart_attempt_2/, etc. is created in the spinup
run directory. These are counted by configure() to enforce
max_consecutive_restarts across multiple compass setup / compass run
invocations.

Re-running analysis after restarts

After the restarted runs complete, run sgh_ensemble_analysis again pointing
at the same spinup ensemble directory. It will pick up the new output and
update analysis_summary.json. If further restarts are needed, repeat
Stage 3.


Full example workflow

# Stage 1 — spinup
compass setup -t landice/ensemble_generator/spinup_ensemble \
    -w /scratch/my_project/spinup -f spinup.cfg -m perlmutter
cd /scratch/my_project/spinup && compass run

# (wait for SLURM jobs to finish)

# Stage 2 — analysis
compass setup -t landice/ensemble_generator/sgh_ensemble_analysis \
    -w /scratch/my_project/analysis -f analysis.cfg -m perlmutter
cd /scratch/my_project/analysis && compass run

# Optional: use RestartScheduler to preview and generate restart config
python3 -c "
from compass.landice.tests.ensemble_generator.sgh_restart_ensemble \
    .restart_scheduler import schedule_restarts
schedule_restarts(
    '/scratch/my_project/analysis/landice/ensemble_generator/'
    'sgh_ensemble_analysis/analyze_ensemble/analysis_summary.json',
    '/scratch/my_project/restart'
)
"

# Stage 3 — restart
compass setup -t landice/ensemble_generator/sgh_restart_ensemble \
    -w /scratch/my_project/restart -f restart_ensemble.cfg -m perlmutter
cd /scratch/my_project/restart && compass run

# (wait for SLURM jobs to finish, then re-run analysis if needed)

Configuration reference

[ensemble_generator]

Option Description
start_run First run number to set up
end_run Last run number to set up (inclusive)
max_samples Total size of the parameter vector
sampling_method sobol, uniform, or log-uniform
ntasks MPI tasks per run
cfl_fraction CFL fraction for adaptive timestepper
ensemble_template Template package name (e.g. sgh_ensemble)

[spinup_ensemble]

Option Description
input_file_path Path to MALI initial condition file

[analysis_ensemble]

Option Description
ensemble_work_dir Path to the spinup ensemble directory
spec_tiff_file (Optional) observed specularity TIFF for data compatibility

[restart_ensemble]

Option Default Description
spinup_work_dir (required) Path to the spinup ensemble directory
analysis_summary_file None Path to analysis_summary.json
max_consecutive_restarts 3 Maximum restart attempts per run
min_simulation_years_before_restart 50.0 Minimum years simulated before restart
auto_restart_incomplete True Automatically restart eligible runs
-->

Checklist

  • User's Guide has been updated
  • Developer's Guide has been updated
  • API documentation in the Developer's Guide (api.rst) has any new or modified class, method and/or functions listed
  • Documentation has been built locally and changes look as expected
  • The E3SM-Project submodule has been updated with relevant E3SM changes
  • The MALI-Dev submodule has been updated with relevant MALI changes
  • Document (in a comment titled Testing in this PR) any testing that was used to verify the changes
  • New tests have been added to a test suite

matthewhoffman and others added 15 commits March 7, 2026 15:18
Generalize ensemble generator to support multiple model configurations used
for different studies.
* Introduced a new configuration module to handle model configurations
  for ensemble generation.
* Updated `BranchRun` and `EnsembleMember` classes to accept a
  `resource_module` parameter for dynamic configuration loading.
* Created default configuration files for branch and spinup ensembles,
  including necessary namelists and stream definitions.
* Modified the main ensemble generator configuration to streamline the
  setup process and improve clarity.
* Enhanced error handling for missing configuration sections and
  options.
* Updated the `SpinupEnsemble` class to utilize the new configuration
  methods for improved modularity and maintainability.
Variations on the word configuration are already too widespread
so this change should reduce confusion.
The primary new functionality is the ability to support any
namelist option as a parameter rather than only pre-defined
parameters.  The refactor also simplifies how parameter values
are specified and puts all parameters in a dedicated cfg section.
More details on the format are included in the updated docs.
Remove unnecessary extra section level,
Separate spinup_ensemble options from options
general to the whole test group.
This function was flagged as too complex, so Copilot helped me break it
into smaller functions.
Update ensemble generator parsing and docs based on outdated but still
relevant Copilot review feedback from PR MPAS-Dev#940.

- Sanitize multiline option parsing in spinup_ensemble._split_entries
  to remove continuation backslashes before tokenization.
- Use importlib.resources.as_file() when handling optional
  albany_input.yaml in ensemble_member setup.
- Clarify in developer docs that albany_input.yaml is copied only
  when present for Albany-based configurations.
- Fix users-guide cfg block indentation and multiline .option_name
  examples to match ConfigParser behavior.
Still need to set up branch ensemble

This commit also makes basal melt and TF input files optional to be more
compatible with SGH configurations.
Introduces new test cases within the ensemble generator:

sgh_analysis - Tests for steady state and validates runs against
specularity content. Creates json files with steady-state/validation
metric for each run

sgh_results - Identifies runs that have not yet reached steady state and
creates a new enemble command to restart these runs.

Functionality still needs testing. Eventually both of these test cases
could be moved within the ensemble manager to perform automatically.
@alexolinhager alexolinhager added in progress This PR is not ready for review or merging land ice labels Mar 20, 2026
@alexolinhager alexolinhager force-pushed the sgh_ensemble_generator_template branch 2 times, most recently from 71bb665 to cefe523 Compare March 25, 2026 17:51
@matthewhoffman matthewhoffman marked this pull request as draft March 29, 2026 16:14
@alexolinhager alexolinhager force-pushed the sgh_ensemble_generator_template branch from cefe523 to 291d679 Compare March 31, 2026 20:11
Copilot AI and others added 3 commits April 1, 2026 18:36
…nd validate_mali_with_spec.py

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/dc37c705-1cb3-4115-998d-a70bb29f03a5

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…t-to-scripts

Add `--plot_dir` argument to subglacial analysis scripts
@alexolinhager alexolinhager force-pushed the sgh_ensemble_generator_template branch from 6df4141 to b644f58 Compare April 1, 2026 19:20
Copilot AI and others added 16 commits April 1, 2026 19:30
Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/9182fc24-d2dd-4907-9289-b21d7182f502

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…n-utility

[WIP] Add module-level helper to sanitize numpy types for JSON
Bug #1: config.get('restart_ensemble', {}) crashes because MpasConfigParser.get()
expects (section, option) positional args, not a dict fallback.
Fixed: config['restart_ensemble'] returns a SectionProxy with proper
.get()/.getint()/.getfloat()/.getboolean() methods.

Bug #2: _should_restart_run() looked for per-run analysis_results.json files
that are never written.  AnalysisStep writes analysis_summary.json to its
own work dir containing an individual_results dict for all runs.
Fixed: add analysis_summary_file config option; configure() loads the file
and passes per-run dicts to _should_restart_run() via a new run_results param.
RestartScheduler.create_config_file() now includes analysis_summary_file in
generated configs.

Bug #3: restart_attempt_N/ tracking dirs were never created by
InPlaceRestartMember.setup(), so max_consecutive_restarts was effectively
disabled and all attempt counters read 0.
Fixed: setup() now creates restart_attempt_N/ dirs using a single os.listdir()
call to find the highest existing attempt number.

Bug #5: restart_scheduler.py docstring Examples section referenced a
non-existent module path. Fixed to the correct path.

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/ca2d29bf-1246-415c-bf2c-9de7521fa55f

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
compass's MpasConfigParser does not support the fallback= keyword argument
on getint/getfloat/getboolean/get.  The correct compass pattern is to load
the shipped .cfg defaults via config.add_from_package() at the start of
configure() so that all options are resolvable without fallback= kwargs.

Changes:
- test_case.py: call self.config.add_from_package() at the top of configure()
  to register ensemble_generator.cfg as the source of defaults, then remove
  all fallback= kwargs from getint/getfloat/getboolean/get calls
- ensemble_generator.cfg: change analysis_summary_file from the
  REPLACE_WITH_... sentinel to `None` (the shipped default) so the existing
  `!= 'none'` guard correctly treats it as "not provided"

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/219c1314-48f3-4507-852a-abf8e87511e4

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…le-generator-structure

Fix critical bugs in sgh_restart_ensemble test case
…artMember

- Remove write_job_script() call that overwrote the original job_script.sh
- Remove add_model_as_input() call (unnecessary and potentially destructive)
- Remove ntasks/min_tasks/config.set/machine/write_job_script config block
- Remove symlink for load_compass_env.sh
- Remove unused imports: configparser, compass.io.symlink, compass.job.write_job_script
- Update docstrings to accurately reflect the simplified setup() behavior

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/0a459658-101e-4f84-8914-c88d0b8e7385

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…write

Fix: InPlaceRestartMember no longer overwrites original job_script.sh
Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/7a970c77-daa6-4f99-8bd8-d5be6f7cd833

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…cript

Fix InPlaceRestartMember overwriting spinup job_script.sh
… steps

Agent-Logs-Url: https://github.com/alexolinhager/compass/sessions/6078fd58-c3e9-490f-9c8d-ec2b66611615

Co-authored-by: alexolinhager <131483939+alexolinhager@users.noreply.github.com>
…in-place-restart

Fix EnsembleManager to rewrite job_script.sh for InPlaceRestartMember steps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in progress This PR is not ready for review or merging land ice

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants