You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
StableAdamW’s learning rate clipping outperformed standard gradient clipping on downstream tasks and led to more stable training
It would be great is this is available as an optimizer in Trainer!
Motivation
More models in the future may use StableAdamW because of its success in training modernBERT, and having it as an option in Trainer (as optim in TrainingArguments) would be convenient.
Your contribution
I'm interested to contribute! The modernBERT paper uses the implementation from optimi, which can be added as an import. I'd love to submit a PR.
The text was updated successfully, but these errors were encountered:
Feature request
StableAdamW is an optimizer first introduced in Stable and low-precision training for large-scale vision-language models, an AdamW and AdaFactor hybrid optimizer, leading to more stable training. Most notably, however, it has been used in the modernBERT paper:
It would be great is this is available as an optimizer in
Trainer
!Motivation
More models in the future may use StableAdamW because of its success in training modernBERT, and having it as an option in
Trainer
(asoptim
inTrainingArguments
) would be convenient.Your contribution
I'm interested to contribute! The modernBERT paper uses the implementation from optimi, which can be added as an import. I'd love to submit a PR.
The text was updated successfully, but these errors were encountered: