You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training Aurora, we noticed a recurring (huge) spike in loss values after roughly around 1.8k steps. While trying to debug this, we tried training Aurora from random initialization on WeatherBench2 and noticed the same spike around the same number of steps. We replicate the learning rate strategy of linear warmup of 1k steps followed by half cosine decay based on the paper. We thought it might have to do with high learning rate since it is rather soon after linear warmup is finished. We also use 32 GPUs, so our setup should be very similar.
We do not seem to get this spike if we reduce the max learning rate to 1e-4 instead of 5e-4.
Do you remember other details to promote stability in your trainings which you can share?
The text was updated successfully, but these errors were encountered:
When training Aurora, we noticed a recurring (huge) spike in loss values after roughly around 1.8k steps. While trying to debug this, we tried training Aurora from random initialization on WeatherBench2 and noticed the same spike around the same number of steps. We replicate the learning rate strategy of linear warmup of 1k steps followed by half cosine decay based on the paper. We thought it might have to do with high learning rate since it is rather soon after linear warmup is finished. We also use 32 GPUs, so our setup should be very similar.
We do not seem to get this spike if we reduce the max learning rate to 1e-4 instead of 5e-4.
Do you remember other details to promote stability in your trainings which you can share?
The text was updated successfully, but these errors were encountered: