This repository has been archived by the owner on Sep 7, 2023. It is now read-only.
When using a transformer, try learning-rate warm up and/or layer norm inside the residual blocks #43
Labels
Projects
https://twitter.com/sytelus/status/1411607820542218245
The text was updated successfully, but these errors were encountered: