Skip to content

01_update.md

Thibaut Lunet edited this page Mar 26, 2025 · 1 revision

Learning the update rather than the solution

Base idea

Consider the generic non-linear autonomous ODE system :

$\displaystyle \frac{dU}{dt} = f(U)$

where $U$ is a vector that can represent a $n$-dimensional problem state (i.e solution). A solution-predicting ML model is built to take an input solution $U_0 = U(t_0)$ and give as output the solution $U_1$ such that

$U_1 \simeq U(t_0 + \Delta t)$

For the training, the dataset is then built using enough pairs of $[U(t_0), U(t_0 + \Delta t)]$ representative of most general solutions of the problem. The time-step $\Delta t$ of the ML model fixed, and determined how is built the dataset from data accumulated through many simulations.

💬 Problem : the smaller the time-step, the more similar all $U(t_0)$ and $U(t_0 + \Delta t)$ will be, which means that when considering a small time-step (required by SDC to be stable), the ML model will try to approximate something close to an identity operator, which is known to be a problem for ML models ...

Hence, we consider an update-predicting ML model with scaling factor $\alpha$, built to take an input solution $U_0 = U(t_0)$ and give as output the update $\Delta_U$ such that

$\Delta_U \simeq \alpha\left(U(t_0 + \Delta t) - U_0\right).$

That way the next solution can be computed like this :

$U_1 = U_0 + \alpha^{-1} \Delta_U.$

A natural suggestion for the scaling is $\alpha=\Delta t^{-1}$, which can be related to the Taylor expansion of $U$ around $t_0$ :

$U_1 = U_0 + \Delta t \left[ f(U_0) + \text{high order terms} \right],$

so $\Delta_U \simeq f(U_0) + \text{high order terms}$. In this case, we build the training dataset by using many representative pairs $[U(t_0), \Delta_U]$ considering a fixed time-step $\Delta t$.

💬 Note that it does not take a lot of work to go from a solution-predicting to an update-predicting ML model : a different dataset is built to avoid unnecessary computations during training, but simulation data are the same, and the ML architecture does not change. That is only when using the model for inference that the linear transformation has to be made, after model evaluation, to retrieve the solution from the update.

Illustration through numerical experiments

We consider a base design for the FNO model, using 2 Fourier Layers with 12 modes in both mesh direction, and an up-scaling transforming the input field having 4 variables ($u_x, u_z, b, p$) to 16 variable before going to the Fourier Layers (and of course, the complementary down-scaling layer before giving the output). A standard averaged L2 norm is used for the loss.

We select 4 time-steps $\Delta t \in {1, 0.1, 0.01, 0.001}$, and first look at the training and validation loss during training, first considering solution-predicting models (decreasing time-step size from left to right) :

grafik

The dashed lines represent the identity loss, that is the loss which would be obtained if instead of using the ML output, we simply use the initial solution input, hence considering an identity operator instead of the model. For large time-step sizes, the model quickly achieves to be better than the identity, but when the time-step decreases, we clearly see how the training struggles to bring the model loss lower to the identity loss. In particular for lower time-step size, the identity loss is very small and the training cannot bring the loss lower to the identity loss for very small time-steps.

Now looking at the training of update-predicting models, we observe something different for the loss (decreasing time-step size from left to right) :

grafik

Here the identity loss is always one, since the identity update is always zero, independent of the time-step size. And for all time-steps size, all training achieve the bring the model loss lower to the identity loss. This is good, as it satisfies one of the main requirement of the trained model : being better than a simply copy of the initial input.

Also, the results are better for the update-predicting model considering all time-steps, as we can observe in the following table summarizing averaged L2 prediction errors for simulation data not used in the training dataset :

time-step 1 0.1 0.01 0.001
solution-predicting 5.3e-02 9.5e-03 1.9e-03 2.1e-03
update-predicting 3.6e-02 5.0e-03 5.0e-04 4.5e-05

💬 Note that for $\Delta t = 1$, the training was stopped a bit sooner for technical reasons, but could have been continued to improve the quality of the model, in particular the prediction errors for $\Delta t = 1$ could go way lower than the those indicated in the table. The results are given mostly to illustrate the main tendency.

Clone this wiki locally