Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O1.3.4 Add/verify checkpointing/restarting capabilities in ClimaAtmos, ClimaLand, ClimaCoupler #1115

Open
3 tasks
juliasloan25 opened this issue Dec 10, 2024 · 0 comments

Comments

@juliasloan25
Copy link
Member

juliasloan25 commented Dec 10, 2024

The Climate Modeling Alliance

Software Design Issue 📜

Purpose

We need reliable, consistent restarting capabilities for our simulations, most urgently for calibration runs. We plan to run these on NCAR's Derecho, which has a 12-hour wall time limit for jobs. The runs we need for calibration exceed this time limit, so they need to be run in shorter time segments and restarted.

We are currently not able to perform consistent restarts of coupled simulations. That is, a simulation that is run without restarting will produce different results than one that is run with the same setup but with restarting. We need to fix this to be able to have reliable runs for coupled model calibration, as mentioned above.

The restart inconsistency for coupled simulations comes from component model initialization and setting of initial conditions. Currently, ClimaCoupler initializes each component independently, then performs an "initial component model exchange" (here), with the intention of exchanging between components the information needed to compute each model's cache. However, it is not guaranteed that this exchange does what it is meant to because we don't know for sure that the order of cache variable updates is correct. The reinitialization step introduces another inconsistency because it resets states to be at the start time, but not caches.

We need to come up with a solution to correctly set initial conditions of component models in coupled simulations, and use this to perform restarts that are verified consistent with non-restarted runs.

Cost/Benefits/Risks

Costs: Understanding cache interdependencies will require a lot of work/time invested
Benefits: Ability to run calibration experiments on Derecho; restarting may facilitate debugging coupled simulations that fail at a specific time; better understanding the model caches and how they interact will be valuable information going forward
Risks: The solution to this isn't clear, so we may try multiple approaches before finding one that works, which may take some time

People and Personnel

Components

Inputs

We want to be able to restart coupled simulations starting from the state and cache of each component model, as well as the stored coupler exchange fields (which can be thought of as the coupler cache).

Results and Deliverables

  • Run a non-restarted simulation for time n; run a restarted simulation for time n broken into time segments; verify results are analytically equivalent

Proposed Change to ClimaCoupler Initial Component Model Exchange

Using the coupled atmos/land case as an example

  • Begin with atmos and land states set from ICs (from restart file, analytic values, etc): Y0^A, Y0^L
  • From these, compute a cache for each model. Note that the cache at this stage will be inconsistent and need to be updated later on. However, all variables required to compute turbulent fluxes must be computed correctly in this step.p0^A' = fA(Y0^A), p0^L' = fL(Y0^L)
  • Compute turbulent fluxes from p0^A' and p0^L'
  • Update the caches by re-calculating all terms that depend on the other's cache: p0^A = gA(p0^A', p0^L'), p0^L = gA(p0^L', p0^A')
  • Perform callbacks (including radiation)

At this point, the atmosphere and land caches should be consistent with each other and with the model states.

SDI Revision Log

SDI opened 10 Dec 2024 by @juliasloan25

CC

@tapios @sriharshakandala @charleskawczynski @cmbengue

Scope of Work

Understanding the problem

Preview Give feedback

Solving the problem

Preview Give feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant