Inquiry About "auraro" Model Detail #47

qqydss · 2024-10-24T05:29:59Z

Introduction:
I have been following the work on Microsoft's weather model "auraro" and have carefully read through the paper and code. I am writing to seek clarification on some details regarding the experimental setup and model architecture. I would greatly appreciate your insights on the following questions:

When using dataset configuration C4 for pretraining, if the inputs come from different data sources, is it required that their corresponding predicted future GroundTruth all come from the ERA5 dataset? In other words, could there be inputs with the same time label but slightly different, corresponding to the same GT? If so, could this be considered a form of data augmentation similar to distorting images in CV classification?
In the "Comparison with AI models at 0.25° resolution" section, figure 4 shows the x-axis as token_num. Could you please explain how this number is calculated?
For dataset labeled as C3, which has only 3 pressure levels in ensemble mode data, when a batch retrieves ensemble mode data, does the corresponding predicted future GD also only have 3 layers? If so, does it use the same weights for latent level query, atmospheric keys & values as shown in figure 6 of the article when input data has 13 pressure levels?
In Figure 4b, is the input for auraro the "HRES Analysis" from HRES_T0 in 2022, and is the groundtruth ERA5?
In the finetune settings of aurara-0.1°, is the GD ERA5?
In figure 3b, is the input for auraro "HRES Analysis" from HRES-T0? As I understand, HRES starts every 12 hours, so there are only two zero lead time fields per day (00/12). Is the evaluation in figure 3b conducted every 12 hours?
In supplement B.7, formula (9), is x the raw data or normalized data? Additionally, I plotted the curve of x_transformed and x and found they are not a monotonic bijective relationship, which might lead to multiple x corresponding to the same x_transformed, causing information loss. Has this factor been considered regarding its impact on model performance?

8.Could you please elaborate on the process of "embedding dependent on the pressure level" in supplement B.7? For example, how does the tensor shape change? Is this operation only for pollution variables or also for U, V, T, Q, Z? Are the embeddings initialized using the weights from a 12-hour pretrained model for U, V, T, Q, Z, while initializing pollution variables from scratch?

9.In D.3-CAMS 0.4° Analysis, how are the learning rates for the backbone and perceiver-decoder set?

In B.7, "Additional static variables" introduce two constant masks for timestamp. However, both the encoder and swin3d_backbone (AdaptiveLayerNorm) use Fourier encoding for timestamp in the code. Why reintroduce a timestamp mask in the input for pollution forecasting?
In model/film.py, AdaptiveLayerNorm initializes self.ln_modulation’s weights and bias to 0, meaning shift and scale are 0 at the start of training, making the backbone almost equivalent to an identical mapping at the beginning. What is the rationale or empirical support behind this unique initialization method?
In pollution forecasting experiments, concatenating static variables(z, slt, lsm) and atmospheric variables together instead of surface variables, what benefits does this bring? Is it performance improvement or computational efficiency?

13.In the fine tune of auraro-0.1°, when the patch size is increased from 4 to 10, is my understanding correct: are 10×10 patches interpolated into 4×4 patches before entering the embedding module, and then during the perceiver decoder stage, these 4×4 patches are interpolated back to 10×10 before unpatchifying to the forecast field pattern? If my understanding is incorrect, could you provide the correct procedure?

In table 4, HRES-0.1 and HRES-0.25 datasets cover almost the same time span and contain exactly the same variables. Why does HRES-0.1 have far fewer Num frames than HRES-0.25?

Thank you very much for your time and consideration. I am eager to learn from your insights!

wesselb · 2024-10-29T08:10:07Z

Hey @qqydss! Thank you for your very thorough questions. Just a quick message to let you know that we've seen this. :) We will back to you shortly!

qqydss · 2024-10-30T02:42:27Z

Great to hear that you've received my questions and will get back to me soon. Looking forward to your response. Thanks!

wesselb · 2024-12-03T10:13:36Z

Hey @qqydss! Apologies for the delay in getting back to you. We made a big push to get a new version of the paper out on arXiv, which we're pretty thrilled about. Let me answer your questions in order.

In dataset configuration C4, for the same timestamp, the model indeed sees different inputs and targets from different sources. For example, for one batch, the model takes in ERA5 and predicts ERA5; and for another batch the model takes in HRES forecasts and predicts HRES forecasts.
The encoder converts the batch to a token-based representation, in the following way: the model first performs a patch encoding of size 4x4 and then aggregates the real pressure levels into 3 "latent" pressure levels. For example, for a batch of resolution 0.25 degree, this results in in 720 / 4 * 1440 / 4 * (3 + 1) \approx 260k tokens per batch. The additional 1 comes from the surface-level variables.
Yes, if the dataset has only three pressure levels, the model also predicts only three pressure levels. There should not be any pressure-level-specific weights. Instead, taking in and predicting pressure levels is done with a "positional encoding" which encodes the hPa of the pressure level, which allows the model to take in and predict any number of pressure levels.
In all cases, the source for the input and target are the same. Hence, in the old Figure 4b, the input and output are both ERA5.
When fine-tuning Aurora to 0.1 degrees resolution, the input and target are both IFS analysis 0.1 degree.
In the old Figure 3b, the input and target are both IFS analysis 0.1 degrees. The high-quality IFS analysis is indeed available only every 12 hours, but an analysis product with a slightly smaller assimilation window should be available at hours 06 and 18.
Good catch! The denominator should be -log(1e-4) instead of log(1e-4). That should make the function monotonic. We'll fix this in the writing. Thank you. :)
The embedding uses a different set of parameters per pressure level. This makes the patch embedding a little more expressive. This is done for both the old meteorological variables and the new air pollution variables.
In the description, "the rest of the network" includes both the backbone and the decoder, so the learning rate should be 1e-4.
You're right that this is not strictly necessary. We added this to help the model a little, since the air pollution variables show very strong diurnal behaviour, much more strongly than the meteorological variables seen during pretraining.
Initialising the model to the identity mapping means that the model predicts the inputs at initialisation, which is usually called the persistence prediction and which should be a pretty good initialisation. Generally, residual models like this tend to be more stable and a little easier to train.
Concatenating the static variables also to the atmospheric variables means that the static variables can also influence the atmopsheric token embeddings. While this is not strictly necessary, we made this change to deal with the increased difficulty of predicting atmospheric chemistry.
Both the patches in the encoder and decoder are changed to shape 10x10, and these new embeddings are initialised by interpolating the original 4x4 patches learned during pretraining to 10x10.
Also a very good catch! There was an error in this table. Please see the revised version of the arXiv paper, where the error should be fixed.

I hope this answers all your questions. Please let me know if anything remains unclear. :)

qqydss · 2025-02-27T16:48:01Z

Further Questions on the Aurora Model

Dear Aurora Team,
A few months ago, I reached out with several questions about the Aurora model, and I was incredibly impressed by your patient and detailed responses, as well as the subsequent updates to the paper.

After carefully reviewing the latest version of the paper, I have a new set of questions that have arisen from my continued study of the model. These questions pertain to various aspects of the model's implementation, training strategies, and experimental design. I believe discussing these points will not only enhance my understanding of the Aurora model but also benefit other researchers who may have similar inquiries. Below is a list of the questions I have compiled:

1. Clarification on Token Count
In our previous discussion, you mentioned that the token count is approximately 260,000, which is far from the million level. However, in the latest version of the paper, Figures G4(c) and G7 indicate a token count at the T level. Does this T stand for trillion? If so, this represents a significant difference from the previously mentioned 260,000. I may be misunderstanding something here and would appreciate your clarification on this matter.

2. Integration of Bathymetry and Geopotential
In Section B8 of the Supplementary Materials, it is mentioned that bathymetry is used in wave forecasting. I am curious to know if this implies that bathymetry is being integrated with the geopotential at the surface (Z) to form a unified elevation map that includes both seabed and land. Could you please elaborate on this?

3. Layer-wise Learning Rate Strategy
In Section D.3 (Short lead-time fine-tuning) of the Supplementary Materials, the learning rates for air pollutants and wave forecasting are set in a very intricate manner, employing a Layer-wise Learning Rate strategy. Could you explain how these different learning rate configurations were determined? Was extensive testing required to arrive at these settings? Additionally, I am curious about whether such a complex training strategy might impact the downstream task transfer potential of Aurora.

4. Clipping Predictions and Normalization
In Section B7, it is mentioned that "clip the predictions only for SO2 at 850 hPa and above at 1 before unnormalisation." I am concerned about whether this discontinuous operation of clipping might adversely affect the backpropagation process. Furthermore, when SO2 is used as an input, is it also clipped to 1 after normalization but before embedding?

5. Initialization of Pressure-level-dependent Patches
Section B7 states, "The pressure-level-dependent patches are initialised with existing pressure-level-independent patches wherever possible and otherwise with uniform random values on the encoder side and with zeros on the decoder side." During the first step of fine-tuning, are the embedding weights for meteorological variables interpolated from the pre-trained model, while air pollution variables are initialized from scratch? Additionally, why is there an inconsistency in the initialization strategies between the encoder and decoder sides? Was this determined through experimentation, or is there another rationale behind this choice?

6. Pre-trained Model for Air Pollution
Is the pre-trained model for air pollution also based on the C4 dataset, simply interpolated to a 0.4° resolution?

Loss Function for Density Channel
In Section D1, the loss for the density channel is calculated using MAE. However, density appears to be a binary Logistic probability (after sigmoid). Why was MAE chosen over cross-entropy, which is commonly used in classification tasks?

8. Additional Layer Normalization
Section B8 mentions keys and queries. Does this imply that latent learnable query vectors are used for channel compression? If so, why not use a simple residual MLP for surface-level embedding as in pre-training?

9. Latitude Weighting in Training vs. Evaluation
Why is latitude weighting not used during training but is applied during evaluation?

10. Replay Buffer Refresh Strategy
Figure D2 states, "Every K steps, the replay buffer is refreshed with a new training sample from the dataset." Before refreshing, is the buffer cleared of its existing samples?

11. Sample Management in Replay Buffer
In Section D.4 (HRES 0.1° analysis), it is mentioned that the buffer holds 20 samples per GPU with a sampling period of 10. According to the caption of Figure D2, "performs a training step, and then it adds this new prediction (together with its next step target from the dataset) to the replay buffer," after 10 steps, the buffer would have 20 samples. If the refresh operation does not clear the buffer, how are samples exceeding the quota removed—on a First In, First Out (FIFO) or Last In, First Out (LIFO) basis?

12. Consistency of Inputs and Ground Truth in Figures G5 and G6
Are the inputs and ground truth the same in Figures G5 and G6, aside from the difference in years?

13. Correction in Equation B9
In the updated version of the paper, Equation B9 still shows log(1e-4) instead of -log(1e-4). Is this an oversight that needs to be corrected?

I sincerely hope these questions do not come across as overly critical. My intention is purely to gain a deeper understanding of the intricacies of the Aurora model and to contribute to the ongoing dialogue within the research community. I truly appreciate your time and effort in addressing these queries.
Thank you once again for your outstanding work and for fostering such an engaging and informative environment.
Best regards,

wesselb · 2025-03-06T15:49:38Z

Hey @qqydss,

Thanks for the additional questions. :) Let me again go through them in order.

T indeed stands for trillion. The token count here is the total amount of tokens process by the backbone, which is 260k for every batch. This is why the number is much larger.
The bathymetry is included as a static variable without further processing. In particular, we did not manually reconcile the bathmetry with the geopotential at the surface to form a unified elevation map. We just included it as another static variable. (Note that the model may internally attempt to reconcile the bathemetry and geopotential at the surface in some way, although you'd have to inspect the neural network to see if this is happening.)
The learning rates were chosen based on the following intuition: (a) 1e-4 trains "slowly" and 1e-3 trains "quickly" and (b) the patch embeddings in the encoder and decoder for the new variables likely need a lot of training. We played around with the learning rates a little, but eventually decided on rates and schedules that just seemed sensible and committed to that. Likely the training procedure could have been simpler, but we wanted to accurately report what we actually used.
You're right that it could adversely affect the backpropagation process, but we did not observe this happening, likely because we only clip when roll-out fine-tuning, at which point we use LoRA instead of full-architecture fine-tuning. When the model is applied to its own inputs, it does indeed take the clipped values as inputs.
You are right that the embedding weights for the meteorological variables are interpolated and the air pollution ones are initialised from scratch. We initialise the new encoder patch embeddings with zeros to not perturb the model too much: setting the weights for the air pollution embeddings to zero means that only meteorological variables contribute to the embedding (at initialisation), like during pretraining. On the decoder side, we believe the choice is less important, and we decided on uniform random values to encourage learning something sensible as quickly as possible.
The base model for air pollution was pretrained like the standard pretrained model but with a 12-hour time step. This means that it was also pretrained at 0.25 degrees resolution, which is different from the resolution of CAMS.
We chose the MAE loss mostly because we had been using it up to that point. We could have considered the cross-entropy loss too, but this seemed to work alright.
Latent learnable query vectors are used to compress the given pressure levels to a fixed number of "latent" pressure levels. It is in this process that extra layer normalisation layers are added. How would you envision using a simple residual MLP instead?
Latitude weighting is used during the evaluation because it is the metric that is most relevant. We could have used latitude weighting during training, but we found that using latitude weighting decreases the performance around the poles (due to down weighting by the latitude weighting). In principle this is no problem - in fact, it is "optimal" for the latitude-weighted RMSE metric - but it becomes a problem for roll-outs when you feed the predictions back into the model. Hence, although latitude weighting would better align with the evaluation metrics, latitude weighting also harms roll-outs, and we concluded that an unweighted MAE makes a better trade-off overall.
The buffer is not cleared of all its existing samples. Refreshing here means that (1) the one sample that has been in the buffer the longest is ejected and (2) a new sample is added. This way the size of the buffer remains constant.
The buffer is refreshed in a FIFO manner (if I'm not mistaken): the sample that has been in the buffer the longest is ejected, and a new one is added in the freed-up spot.
No, the initial conditions and ground truths are different. G5 uses HRES T0 2022 as initial condition and ground truth, and G6 uses ERA5 as initial condition and ground truth.
Unfortunately, yes. :( The edit did not make it in time. We will post a new revision not too long from now that will incorporate the fix. Thanks again for pointing this out.

qqydss · 2025-03-10T17:35:58Z

Thank you for your comprehensive and insightful response. Your answers have provided me with valuable insights into your work, and I'm now developing a local meteorological model with multi-modal inputs and multi-task outputs based on your architecture. Here are a few follow-up questions I have:

For question 1, can tokens be considered as 260,000 multiplied by the number of iteration steps? If so, from Figure G7, it appears that the 290M, 660M, and 1.3B versions of Aurora haven't yet reached the plateau of validation performance, indicating potential for further improvement. This might be due to computational budget constraints leading to early training termination, suggesting untapped model potential.

For question 8, thank you for clarifying that it's about learnable queries in atmospheric variable pressure compression. Initially, I thought it was about compressing surface and static variables in wave forecasting across variable types. Regarding your question about using a simple residual MLP instead, would it be feasible to transform tokens from shape [b, long, d] to [b, short, d] via matrix multiplication with a [short, long] matrix?

During pretraining, how many steps does the model roll out? If it's not single-step, does backpropagation occur across all steps or only a subset, similar to roll-out fine-tuning?

During inference, if new observations become available online, could we train a lightweight controller to adjust the model's output in real-time? This idea draws from data assimilation and control theory concepts and is presented for discussion.

Thank you again for your guidance. I look forward to your insights on these points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry About "auraro" Model Detail #47

Inquiry About "auraro" Model Detail #47

qqydss commented Oct 24, 2024

wesselb commented Oct 29, 2024

qqydss commented Oct 30, 2024

wesselb commented Dec 3, 2024

qqydss commented Feb 27, 2025

wesselb commented Mar 6, 2025

qqydss commented Mar 10, 2025

Inquiry About "auraro" Model Detail #47

Inquiry About "auraro" Model Detail #47

Comments

qqydss commented Oct 24, 2024

wesselb commented Oct 29, 2024

qqydss commented Oct 30, 2024

wesselb commented Dec 3, 2024

qqydss commented Feb 27, 2025

wesselb commented Mar 6, 2025

qqydss commented Mar 10, 2025