CUDA 11.0 instead of 11.1 for TF 2.4 compatibility; also named model outputs for huggingface >= 3.0.0 #47

jarednielsen · 2021-01-14T02:36:05Z

Change optimizer.loss_scale() to optimizer.loss_scale for TF 2.4. Fixes #44.

I ran pretraining on ALBERT for a few thousand steps with XLA enabled on TF 2.4 and observed the loss drop from 11 to 7.5 in 3k steps with global batch size 32. I am not able to replicate #45. It is possible that fixing the named model outputs has solved this issue.

jarednielsen added 8 commits September 8, 2020 16:34

Bump precommit versions of isort and black

c28ce68

WIP

89b32a2

Remove ginfile

b41a6f6

Working Dockerfile; CUDA 11.0 instead of 11.1 for TF 2.4

757b992

Remove ldconfig from horovod installation

1ce25d1

Handle named outputs from models in transformers>=3.0

1cda4f4

Nit

8a5a7b9

Merge branch 'master' into nlp3

e8c820e

jarednielsen merged commit 6cb5700 into aws-samples:master Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA 11.0 instead of 11.1 for TF 2.4 compatibility; also named model outputs for huggingface >= 3.0.0 #47

CUDA 11.0 instead of 11.1 for TF 2.4 compatibility; also named model outputs for huggingface >= 3.0.0 #47

jarednielsen commented Jan 14, 2021

CUDA 11.0 instead of 11.1 for TF 2.4 compatibility; also named model outputs for huggingface >= 3.0.0 #47

CUDA 11.0 instead of 11.1 for TF 2.4 compatibility; also named model outputs for huggingface >= 3.0.0 #47

Conversation

jarednielsen commented Jan 14, 2021