OOM on 345M with GPU #69

babaraza · 2021-02-01T04:04:39Z

Hi,
I am getting an error when trying to train using 345M with GPU. If I use CPU it trains fine, albeit very slowly. I am using Nvidia GTX 1070 and have CUDA and CUDNN installed.

The interactive_conditional_samples.py and generate_unconditional_samples.py work fine with GPU so I know GPU is working. I only encounter the OOM when trying to train.

I tried using the "--optimizer sg" flag and using default batch_size of 1:
python train.py --dataset data.npz --model_name 345M --optimizer sgd

Error (truncated):
2021-01-31 21:52:28.744480: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory
trying to allocate 4.00MiB (rounded to 4194304). Current allocation summary follows.
2021-01-31 21:52:28.744901: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 276, Chunks
in use: 266. 69.0KiB allocated for chunks. 66.5KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-01-31 21:52:28.745756: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 0, Chunks in
use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-01-31 21:52:28.746117: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in
use: 1. 4.0KiB allocated for chunks. 1.3KiB in use in bin. 1.0KiB client-requested in use in bin.

...

2021-01-31 21:52:28.966844: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.64GiB
2021-01-31 21:52:28.966890: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 71357135
36 memory_limit_: 7135713690 available bytes: 154 curr_region_allocation_bytes_: 8589934592
2021-01-31 21:52:28.966950: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 7135713690
InUse: 7134272000
MaxInUse: 7134274048
NumAllocs: 3078
MaxAllocSize: 268435456

2021-01-31 21:52:28.967091: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ***************************************

2021-01-31 21:52:28.967156: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cwise_ops_common.cc:82 :
Resource exhausted: OOM when allocating tensor with shape[1,16,1024,1024] and type bool on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc

Any idea on how to resolve this?

The text was updated successfully, but these errors were encountered:

jaimu97 · 2021-02-23T08:57:51Z

On a single 1070? I don't think that's possible. I'm currently training a 10GB dataset using the 345M model on a 3090 and it's using ~17GB on VRAM

babaraza · 2021-02-23T15:23:17Z

Thank you for your reply, I am trying to get my hands on a 3080 or 3090 for this very reason and your screenshot and message just confirmed I need the 3090!

jaimu97 · 2021-02-23T20:43:20Z

I honestly couldn't recommend getting a 3090 just for training/fine-tuning 345M gpt-2. 117M Is definitely good enough for every use case (for me anyway) if your 1070 can handle that. I only use it to train when I'm at work. ;)

babaraza · 2021-02-24T04:35:02Z

I agree it won’t solely be for training I just wanted to justify it for my gaming needs ;) with that said it’s almost impossible to find 3080/3090 for good prices so I’m waiting it out. I do appreciate everyone’s input. I did train the 117M and it does seem to give good results. There’s also the google collabs that everyone’s been using to train since they give you access to a nvidia t4 for free.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM on 345M with GPU #69

OOM on 345M with GPU #69

babaraza commented Feb 1, 2021

jaimu97 commented Feb 23, 2021

babaraza commented Feb 23, 2021

jaimu97 commented Feb 23, 2021

babaraza commented Feb 24, 2021

OOM on 345M with GPU #69

OOM on 345M with GPU #69

Comments

babaraza commented Feb 1, 2021

jaimu97 commented Feb 23, 2021

babaraza commented Feb 23, 2021

jaimu97 commented Feb 23, 2021

babaraza commented Feb 24, 2021