-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry About "auraro" Model Detail #47
Comments
Hey @qqydss! Thank you for your very thorough questions. Just a quick message to let you know that we've seen this. :) We will back to you shortly! |
Great to hear that you've received my questions and will get back to me soon. Looking forward to your response. Thanks! |
Hey @qqydss! Apologies for the delay in getting back to you. We made a big push to get a new version of the paper out on arXiv, which we're pretty thrilled about. Let me answer your questions in order.
I hope this answers all your questions. Please let me know if anything remains unclear. :) |
Further Questions on the Aurora Model Dear Aurora Team, After carefully reviewing the latest version of the paper, I have a new set of questions that have arisen from my continued study of the model. These questions pertain to various aspects of the model's implementation, training strategies, and experimental design. I believe discussing these points will not only enhance my understanding of the Aurora model but also benefit other researchers who may have similar inquiries. Below is a list of the questions I have compiled: 1. Clarification on Token Count 2. Integration of Bathymetry and Geopotential 3. Layer-wise Learning Rate Strategy 4. Clipping Predictions and Normalization 5. Initialization of Pressure-level-dependent Patches 6. Pre-trained Model for Air Pollution
8. Additional Layer Normalization 9. Latitude Weighting in Training vs. Evaluation 10. Replay Buffer Refresh Strategy 11. Sample Management in Replay Buffer 12. Consistency of Inputs and Ground Truth in Figures G5 and G6 13. Correction in Equation B9 I sincerely hope these questions do not come across as overly critical. My intention is purely to gain a deeper understanding of the intricacies of the Aurora model and to contribute to the ongoing dialogue within the research community. I truly appreciate your time and effort in addressing these queries. |
Hey @qqydss, Thanks for the additional questions. :) Let me again go through them in order.
|
Thank you for your comprehensive and insightful response. Your answers have provided me with valuable insights into your work, and I'm now developing a local meteorological model with multi-modal inputs and multi-task outputs based on your architecture. Here are a few follow-up questions I have: For question 1, can tokens be considered as 260,000 multiplied by the number of iteration steps? If so, from Figure G7, it appears that the 290M, 660M, and 1.3B versions of Aurora haven't yet reached the plateau of validation performance, indicating potential for further improvement. This might be due to computational budget constraints leading to early training termination, suggesting untapped model potential. For question 8, thank you for clarifying that it's about learnable queries in atmospheric variable pressure compression. Initially, I thought it was about compressing surface and static variables in wave forecasting across variable types. Regarding your question about using a simple residual MLP instead, would it be feasible to transform tokens from shape [b, long, d] to [b, short, d] via matrix multiplication with a [short, long] matrix? During pretraining, how many steps does the model roll out? If it's not single-step, does backpropagation occur across all steps or only a subset, similar to roll-out fine-tuning? During inference, if new observations become available online, could we train a lightweight controller to adjust the model's output in real-time? This idea draws from data assimilation and control theory concepts and is presented for discussion. Thank you again for your guidance. I look forward to your insights on these points. |
Introduction:
I have been following the work on Microsoft's weather model "auraro" and have carefully read through the paper and code. I am writing to seek clarification on some details regarding the experimental setup and model architecture. I would greatly appreciate your insights on the following questions:
When using dataset configuration C4 for pretraining, if the inputs come from different data sources, is it required that their corresponding predicted future GroundTruth all come from the ERA5 dataset? In other words, could there be inputs with the same time label but slightly different, corresponding to the same GT? If so, could this be considered a form of data augmentation similar to distorting images in CV classification?
In the "Comparison with AI models at 0.25° resolution" section, figure 4 shows the x-axis as token_num. Could you please explain how this number is calculated?
For dataset labeled as C3, which has only 3 pressure levels in ensemble mode data, when a batch retrieves ensemble mode data, does the corresponding predicted future GD also only have 3 layers? If so, does it use the same weights for latent level query, atmospheric keys & values as shown in figure 6 of the article when input data has 13 pressure levels?
In Figure 4b, is the input for auraro the "HRES Analysis" from HRES_T0 in 2022, and is the groundtruth ERA5?
In the finetune settings of aurara-0.1°, is the GD ERA5?
In figure 3b, is the input for auraro "HRES Analysis" from HRES-T0? As I understand, HRES starts every 12 hours, so there are only two zero lead time fields per day (00/12). Is the evaluation in figure 3b conducted every 12 hours?
In supplement B.7, formula (9), is x the raw data or normalized data? Additionally, I plotted the curve of x_transformed and x and found they are not a monotonic bijective relationship, which might lead to multiple x corresponding to the same x_transformed, causing information loss. Has this factor been considered regarding its impact on model performance?
8.Could you please elaborate on the process of "embedding dependent on the pressure level" in supplement B.7? For example, how does the tensor shape change? Is this operation only for pollution variables or also for U, V, T, Q, Z? Are the embeddings initialized using the weights from a 12-hour pretrained model for U, V, T, Q, Z, while initializing pollution variables from scratch?
9.In D.3-CAMS 0.4° Analysis, how are the learning rates for the backbone and perceiver-decoder set?
In B.7, "Additional static variables" introduce two constant masks for timestamp. However, both the encoder and swin3d_backbone (AdaptiveLayerNorm) use Fourier encoding for timestamp in the code. Why reintroduce a timestamp mask in the input for pollution forecasting?
In model/film.py, AdaptiveLayerNorm initializes self.ln_modulation’s weights and bias to 0, meaning shift and scale are 0 at the start of training, making the backbone almost equivalent to an identical mapping at the beginning. What is the rationale or empirical support behind this unique initialization method?
In pollution forecasting experiments, concatenating static variables(z, slt, lsm) and atmospheric variables together instead of surface variables, what benefits does this bring? Is it performance improvement or computational efficiency?
13.In the fine tune of auraro-0.1°, when the patch size is increased from 4 to 10, is my understanding correct: are 10×10 patches interpolated into 4×4 patches before entering the embedding module, and then during the perceiver decoder stage, these 4×4 patches are interpolated back to 10×10 before unpatchifying to the forecast field pattern? If my understanding is incorrect, could you provide the correct procedure?
Thank you very much for your time and consideration. I am eager to learn from your insights!
The text was updated successfully, but these errors were encountered: