Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout Scheduling #145

Open
HCookie opened this issue Nov 14, 2024 · 6 comments
Open

Rollout Scheduling #145

HCookie opened this issue Nov 14, 2024 · 6 comments
Assignees
Labels
ECMWF help wanted Extra attention is needed

Comments

@HCookie
Copy link
Member

HCookie commented Nov 14, 2024

Our current rollout implementation is very focused on sequential epoch increments, it would be good to generalise this to provide schedulers to control rollout.

Work was done in aifs-mono to enable this. here
I think this can be generalised and provide more general applicability.

Features

Below is a list of features and requirements as I see them

  • Epoch Step rollout
  • Static Rollout - constant (already supported)
  • Upon hitting a threshold begin another strategy
  • Random selection of rollout between bounds
  • Dynamic selection of increments
    • e.g. At epoch 5, increment by 2, at epoch 13, increment by 3

Improvements

Setup config at begin of training with rollout increment be

Questions

  • Will the rollout only change between epochs? Could within an epoch it change?

What other features may be needed?

@HCookie HCookie added help wanted Extra attention is needed ECMWF labels Nov 14, 2024
@HCookie HCookie self-assigned this Nov 14, 2024
@mchantry
Copy link
Member

What does static mean? Constant at, say 2? This is already supported.
What does dynamic selection mean?

@HCookie
Copy link
Member Author

HCookie commented Nov 15, 2024

@mchantry Updated the description

@mc4117
Copy link
Member

mc4117 commented Nov 15, 2024

I like the idea of dynamic selection of increments and I was also wondering if this could be done by steps as well as by epochs? For example at step 1000, do roll 2, at step 10000, do roll 10.
Also I think this would avoid the issue of if you wanted to do rollout within epochs as you could then define it by steps instead

@jakob-schloer
Copy link
Collaborator

I agree with @mc4117. Some models show a better performance when trained for longer on 2-steps and only some iterations on longer rollout steps. |
I wonder, however, if that could not be solved by limiting the number of batches per epoch and provide a list of rollout lengths, e.g. [2,2,2,2,2,2,2,2,3,4,5,6,...].

@anaprietonem
Copy link
Contributor

I like @mc4117 suggestion regarding supporting rollout by steps. I think this probably would make things easier if, in the future, we want to automate the training so that the 6-hour and the rollout steps are executed one after the other.

@HCookie
Copy link
Member Author

HCookie commented Nov 18, 2024

Moving to a discussion (to try it out)
#148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECMWF help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants