Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

When using a transformer, try learning-rate warm up and/or layer norm inside the residual blocks #43

Open
JackKelly opened this issue Jul 4, 2021 · 0 comments

Comments

@JackKelly
Copy link
Member

https://twitter.com/sytelus/status/1411607820542218245

@JackKelly JackKelly created this issue from a note in ML research (To do) Jul 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

1 participant