Skip to content
This repository has been archived by the owner on Sep 19, 2020. It is now read-only.

Improve your state of the art by using best activation function and best meta optimizer #2

Open
LifeIsStrange opened this issue May 30, 2020 · 5 comments

Comments

@LifeIsStrange
Copy link

LifeIsStrange commented May 30, 2020

You could increase GPT 3 accuracy by using Ranger, which combine state of the art optimizers + gradient centralization
https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
You seem to be using the Adam optimizer. It has been succeeded by RAdam (rectified Adam). Ranger will bring you this improvment and a lot more synergistic others, for free.

Hortogonally, you would probably benefit from Mish too instead of the one you use (Relu ?) but should be tested after Ranger as it could regress accuracy (even if unlikely)
https://github.com/digantamisra98/Mish

@minimaxir
Copy link

At the level these models are trained at, using a specific optimizer/activation will not necessarily get you better results.

@digantamisra98
Copy link

Additionally considering GPT-3 size I would suggest not to use any optimizer above SGD because of the computation levels. Same goes for Mish.

@LifeIsStrange
Copy link
Author

@minimaxir it will not necessarily bring gains but it is still a low hanging fruit that should be tried.

@LifeIsStrange
Copy link
Author

LifeIsStrange commented Jun 1, 2020

@digantamisra98 RAdam (not the full Ranger package) does not increase computational cost.

I've read somewhere that Mish can be as efficient as Relu
Maybe with https://github.com/thomasbrandon/mish-cuda?

@digantamisra98
Copy link

@LifeIsStrange everything above SGD is expensive.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants