This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Is T5 3B training properly parallelizing? #4559
Labels
donotreap
Avoid automatically marking as stale.
I am trying to train a T5 model on empathetic dialogues. I am running into cuda OOM errors when training my model with the following command. When training the BlenderBot 3B model, I ran into this issue until I parallelized my training across two GPUs. However, it seems that parallelizing T5 3B doesn't resolve the issue. Also, I've reduced the batchsize to 1 and the truncate to 128 (truncate at 64 also doesn't work). Any suggestions to resolve the issue?
Command
parlai train_model -t empathetic_dialogues -m hugging_face/t5 --t5-model-arch t5-3b --t5-model-parallel True --fp16 True --optimizer adam --batchsize 1 --skip-generation True -vmt ppl -tr 64 --model-file ./chatbot_models/3B/testdebugT5/model --tstep 100
Error message
The text was updated successfully, but these errors were encountered: