Out of Memory Error While training on large dataset #113

mmaaz60 · 2019-11-29T05:31:36Z

Hi I am training on voc2012 dataset with around 5717 training and 5824 validation examples. After few epochs the system kills the training process. The system logs are attached. The environment details are as below,
OS: Ubuntu 18.04
GPU: 1080 Ti
Cuda: v10.0 with Cudnn v7.3.1
GPU Driver: v410.104

AnaRhisT94 · 2019-12-02T07:55:30Z

Hi I am training on voc2012 dataset with around 5717 training and 5824 validation examples. After few epochs the system kills the training process. The system logs are attached. The environment details are as below,
OS: Ubuntu 18.04
GPU: 1080 Ti
Cuda: v10.0 with Cudnn v7.3.1
GPU Driver: v410.104

That's strange, are you sure the libraries of cuda, cudnn are loaded successfully and you are in fact using GPU? Other than that, if you are using GPU, just restrict it's usage with the following code:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 3GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3000)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

mmaaz60 · 2019-12-02T17:53:25Z

Hi @AnaRhisT94,

My PC's RAM memory is also fully utilized. 15.9 GB or RAM is being utilized out of 16 GB after a few epochs. The bigger the data set is, the quicker it gives the out of memory error. What could have gone wrong? Could it be something with ubuntu 18.04 and cuda 10.0? Would switching back to ubuntu 16.04 can help?

Thanks

AnaRhisT94 · 2019-12-02T17:56:58Z

Hi @AnaRhisT94,

My PC's RAM memory is also fully utilized. 15.9 GB or RAM is being utilized out of 16 GB after a few epochs. The bigger the data set is, the quicker it gives the out of memory error. What could have gone wrong? Could it be something with ubuntu 18.04 and cuda 10.0? Would switching back to ubuntu 16.04 can help?

Thanks

It's not good. You need to utilize vRAM not RAM.
Ram is used on CPU. VRAM on GPU.
Use the code to restrict ur VRAM usage.
Check if libraries cuda cudnn are loaded successfully in order to utilize your GPU.

mmaaz60 · 2019-12-03T06:27:43Z

Hi @AnaRhisT94,

Just providing more info. Please find attached the screen shots of memory usage, gpu memory usage and terminal output. I noticed that after each epoch the it fills the shuffle buffer, this may be using the system RAM.

mmaaz60 · 2019-12-03T06:51:45Z

lowering the shuffle buffer (from 1024 to 32 works for me)

AnaRhisT94 · 2019-12-03T09:43:37Z

lowering the shuffle buffer (from 1024 to 32 works for me)

Yes exactly, that makes sense. In my case I'm using 256 as the shuffle_size.

mmaaz60 · 2019-12-03T10:24:08Z

Hi @AnaRhisT94,

I can now successfully train the model and my loss goes to as little as 0.2. And still I can't detect any object. Note below the format of TRRecords in preparing and parsing them is a little different in order and I think it doesn't matter. I also tried lowering the confidence threshold and it didn't work as well. What could have gone wrong? Any idea?

zzh8829 · 2019-12-21T11:59:17Z

Please see this for transfer learning. https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md
Training from scratch without loading darknet is almost impossible

edurenye · 2020-05-07T08:32:10Z

I have the same problem, I tried reducing the shuffle buffer_size to 8 without luck.
When I use CPU it works, but when I use GPU I have this fine, but of course it takes too long so I want to use the GPU.

The exact error I get is:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1024,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node yolov3/yolo_darknet/conv2d_43/Conv2D (defined at train.py:188) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_37186]

edurenye · 2020-05-07T19:21:53Z

Finally I made it work, I left buffer_size as it was and set a really small batch size of 2, then it worked.
But I think something is using to much memory, it should allow a bigger batch size. My images are 876x657, not that big and my Graphic card is a GeForce RTX 2060 6GB GDDR6. Should not fill 6GB so fast.

zzh8829 added the training Training Related Questions label Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory Error While training on large dataset #113

Out of Memory Error While training on large dataset #113

mmaaz60 commented Nov 29, 2019

AnaRhisT94 commented Dec 2, 2019

mmaaz60 commented Dec 2, 2019

AnaRhisT94 commented Dec 2, 2019

mmaaz60 commented Dec 3, 2019

mmaaz60 commented Dec 3, 2019

AnaRhisT94 commented Dec 3, 2019

mmaaz60 commented Dec 3, 2019

zzh8829 commented Dec 21, 2019

edurenye commented May 7, 2020

edurenye commented May 7, 2020

Out of Memory Error While training on large dataset #113

Out of Memory Error While training on large dataset #113

Comments

mmaaz60 commented Nov 29, 2019

AnaRhisT94 commented Dec 2, 2019

mmaaz60 commented Dec 2, 2019

AnaRhisT94 commented Dec 2, 2019

mmaaz60 commented Dec 3, 2019

mmaaz60 commented Dec 3, 2019

AnaRhisT94 commented Dec 3, 2019

mmaaz60 commented Dec 3, 2019

zzh8829 commented Dec 21, 2019

edurenye commented May 7, 2020

edurenye commented May 7, 2020