Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Error While training on large dataset #113

Open
mmaaz60 opened this issue Nov 29, 2019 · 10 comments
Open

Out of Memory Error While training on large dataset #113

mmaaz60 opened this issue Nov 29, 2019 · 10 comments
Labels
training Training Related Questions

Comments

@mmaaz60
Copy link

mmaaz60 commented Nov 29, 2019

Hi I am training on voc2012 dataset with around 5717 training and 5824 validation examples. After few epochs the system kills the training process. The system logs are attached. The environment details are as below,
OS: Ubuntu 18.04
GPU: 1080 Ti
Cuda: v10.0 with Cudnn v7.3.1
GPU Driver: v410.104
out_of_memory

@AnaRhisT94
Copy link

Hi I am training on voc2012 dataset with around 5717 training and 5824 validation examples. After few epochs the system kills the training process. The system logs are attached. The environment details are as below,
OS: Ubuntu 18.04
GPU: 1080 Ti
Cuda: v10.0 with Cudnn v7.3.1
GPU Driver: v410.104
out_of_memory

That's strange, are you sure the libraries of cuda, cudnn are loaded successfully and you are in fact using GPU? Other than that, if you are using GPU, just restrict it's usage with the following code:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 3GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=3000)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

@mmaaz60
Copy link
Author

mmaaz60 commented Dec 2, 2019

Hi @AnaRhisT94,

My PC's RAM memory is also fully utilized. 15.9 GB or RAM is being utilized out of 16 GB after a few epochs. The bigger the data set is, the quicker it gives the out of memory error. What could have gone wrong? Could it be something with ubuntu 18.04 and cuda 10.0? Would switching back to ubuntu 16.04 can help?

Thanks

@AnaRhisT94
Copy link

Hi @AnaRhisT94,

My PC's RAM memory is also fully utilized. 15.9 GB or RAM is being utilized out of 16 GB after a few epochs. The bigger the data set is, the quicker it gives the out of memory error. What could have gone wrong? Could it be something with ubuntu 18.04 and cuda 10.0? Would switching back to ubuntu 16.04 can help?

Thanks

It's not good. You need to utilize vRAM not RAM.
Ram is used on CPU. VRAM on GPU.
Use the code to restrict ur VRAM usage.
Check if libraries cuda cudnn are loaded successfully in order to utilize your GPU.

@mmaaz60
Copy link
Author

mmaaz60 commented Dec 3, 2019

Hi @AnaRhisT94,

Just providing more info. Please find attached the screen shots of memory usage, gpu memory usage and terminal output. I noticed that after each epoch the it fills the shuffle buffer, this may be using the system RAM.
terminal_output
cpu_and_memory_usage
gpu_usage

@mmaaz60
Copy link
Author

mmaaz60 commented Dec 3, 2019

lowering the shuffle buffer (from 1024 to 32 works for me)

@AnaRhisT94
Copy link

lowering the shuffle buffer (from 1024 to 32 works for me)

Yes exactly, that makes sense. In my case I'm using 256 as the shuffle_size.

@mmaaz60
Copy link
Author

mmaaz60 commented Dec 3, 2019

Hi @AnaRhisT94,

I can now successfully train the model and my loss goes to as little as 0.2. And still I can't detect any object. Note below the format of TRRecords in preparing and parsing them is a little different in order and I think it doesn't matter. I also tried lowering the confidence threshold and it didn't work as well. What could have gone wrong? Any idea?
data121
data

@zzh8829 zzh8829 added the training Training Related Questions label Dec 20, 2019
@zzh8829
Copy link
Owner

zzh8829 commented Dec 21, 2019

Please see this for transfer learning. https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md
Training from scratch without loading darknet is almost impossible

@edurenye
Copy link
Contributor

edurenye commented May 7, 2020

I have the same problem, I tried reducing the shuffle buffer_size to 8 without luck.
When I use CPU it works, but when I use GPU I have this fine, but of course it takes too long so I want to use the GPU.

The exact error I get is:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1024,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node yolov3/yolo_darknet/conv2d_43/Conv2D (defined at train.py:188) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_37186]

@edurenye
Copy link
Contributor

edurenye commented May 7, 2020

Finally I made it work, I left buffer_size as it was and set a really small batch size of 2, then it worked.
But I think something is using to much memory, it should allow a bigger batch size. My images are 876x657, not that big and my Graphic card is a GeForce RTX 2060 6GB GDDR6. Should not fill 6GB so fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Training Related Questions
Projects
None yet
Development

No branches or pull requests

4 participants