You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need reliable, consistent restarting capabilities for our simulations, most urgently for calibration runs. We plan to run these on NCAR's Derecho, which has a 12-hour wall time limit for jobs. The runs we need for calibration exceed this time limit, so they need to be run in shorter time segments and restarted.
We are currently not able to perform consistent restarts of coupled simulations. That is, a simulation that is run without restarting will produce different results than one that is run with the same setup but with restarting. We need to fix this to be able to have reliable runs for coupled model calibration, as mentioned above.
The restart inconsistency for coupled simulations comes from component model initialization and setting of initial conditions. Currently, ClimaCoupler initializes each component independently, then performs an "initial component model exchange" (here), with the intention of exchanging between components the information needed to compute each model's cache. However, it is not guaranteed that this exchange does what it is meant to because we don't know for sure that the order of cache variable updates is correct. The reinitialization step introduces another inconsistency because it resets states to be at the start time, but not caches.
We need to come up with a solution to correctly set initial conditions of component models in coupled simulations, and use this to perform restarts that are verified consistent with non-restarted runs.
Cost/Benefits/Risks
Costs: Understanding cache interdependencies will require a lot of work/time invested
Benefits: Ability to run calibration experiments on Derecho; restarting may facilitate debugging coupled simulations that fail at a specific time; better understanding the model caches and how they interact will be valuable information going forward
Risks: The solution to this isn't clear, so we may try multiple approaches before finding one that works, which may take some time
Simplify initial component model exchange to perform only the required operations (e.g. no extraneous step!s)
Inputs
We want to be able to restart coupled simulations starting from the state and cache of each component model, as well as the stored coupler exchange fields (which can be thought of as the coupler cache).
Results and Deliverables
Run a non-restarted simulation for time n; run a restarted simulation for time n broken into time segments; verify results are analytically equivalent
Proposed Change to ClimaCoupler Initial Component Model Exchange
Using the coupled atmos/land case as an example
Begin with atmos and land states set from ICs (from restart file, analytic values, etc): Y0^A, Y0^L
From these, compute a cache for each model. Note that the cache at this stage will be inconsistent and need to be updated later on. However, all variables required to compute turbulent fluxes must be computed correctly in this step.p0^A' = fA(Y0^A), p0^L' = fL(Y0^L)
Compute turbulent fluxes from p0^A' and p0^L'
Update the caches by re-calculating all terms that depend on the other's cache: p0^A = gA(p0^A', p0^L'), p0^L = gA(p0^L', p0^A')
Perform callbacks (including radiation)
At this point, the atmosphere and land caches should be consistent with each other and with the model states.
The Climate Modeling Alliance
Software Design Issue 📜
Purpose
We need reliable, consistent restarting capabilities for our simulations, most urgently for calibration runs. We plan to run these on NCAR's Derecho, which has a 12-hour wall time limit for jobs. The runs we need for calibration exceed this time limit, so they need to be run in shorter time segments and restarted.
We are currently not able to perform consistent restarts of coupled simulations. That is, a simulation that is run without restarting will produce different results than one that is run with the same setup but with restarting. We need to fix this to be able to have reliable runs for coupled model calibration, as mentioned above.
The restart inconsistency for coupled simulations comes from component model initialization and setting of initial conditions. Currently, ClimaCoupler initializes each component independently, then performs an "initial component model exchange" (here), with the intention of exchanging between components the information needed to compute each model's cache. However, it is not guaranteed that this exchange does what it is meant to because we don't know for sure that the order of cache variable updates is correct. The reinitialization step introduces another inconsistency because it resets states to be at the start time, but not caches.
We need to come up with a solution to correctly set initial conditions of component models in coupled simulations, and use this to perform restarts that are verified consistent with non-restarted runs.
Cost/Benefits/Risks
Costs: Understanding cache interdependencies will require a lot of work/time invested
Benefits: Ability to run calibration experiments on Derecho; restarting may facilitate debugging coupled simulations that fail at a specific time; better understanding the model caches and how they interact will be valuable information going forward
Risks: The solution to this isn't clear, so we may try multiple approaches before finding one that works, which may take some time
People and Personnel
Components
step!
s)Inputs
We want to be able to restart coupled simulations starting from the state and cache of each component model, as well as the stored coupler exchange fields (which can be thought of as the coupler cache).
Results and Deliverables
n
; run a restarted simulation for timen
broken into time segments; verify results are analytically equivalentProposed Change to ClimaCoupler Initial Component Model Exchange
Using the coupled atmos/land case as an example
p0^A' = fA(Y0^A)
,p0^L' = fL(Y0^L)
p0^A'
andp0^L'
p0^A = gA(p0^A', p0^L')
,p0^L = gA(p0^L', p0^A')
At this point, the atmosphere and land caches should be consistent with each other and with the model states.
SDI Revision Log
SDI opened 10 Dec 2024 by @juliasloan25
CC
@tapios @sriharshakandala @charleskawczynski @cmbengue
Scope of Work
Understanding the problem
Solving the problem
The text was updated successfully, but these errors were encountered: