Questions on training and stability #70

firatozdemir · 2025-03-12T09:16:23Z

When training Aurora, we noticed a recurring (huge) spike in loss values after roughly around 1.8k steps. While trying to debug this, we tried training Aurora from random initialization on WeatherBench2 and noticed the same spike around the same number of steps. We replicate the learning rate strategy of linear warmup of 1k steps followed by half cosine decay based on the paper. We thought it might have to do with high learning rate since it is rather soon after linear warmup is finished. We also use 32 GPUs, so our setup should be very similar.
We do not seem to get this spike if we reduce the max learning rate to 1e-4 instead of 5e-4.
Do you remember other details to promote stability in your trainings which you can share?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on training and stability #70

Questions on training and stability #70

firatozdemir commented Mar 12, 2025

Questions on training and stability #70

Questions on training and stability #70

Comments

firatozdemir commented Mar 12, 2025