-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of Memory Error While training on large dataset #113
Comments
Hi @AnaRhisT94, My PC's RAM memory is also fully utilized. 15.9 GB or RAM is being utilized out of 16 GB after a few epochs. The bigger the data set is, the quicker it gives the out of memory error. What could have gone wrong? Could it be something with ubuntu 18.04 and cuda 10.0? Would switching back to ubuntu 16.04 can help? Thanks |
It's not good. You need to utilize vRAM not RAM. |
Hi @AnaRhisT94, Just providing more info. Please find attached the screen shots of memory usage, gpu memory usage and terminal output. I noticed that after each epoch the it fills the shuffle buffer, this may be using the system RAM. |
lowering the shuffle buffer (from 1024 to 32 works for me) |
Yes exactly, that makes sense. In my case I'm using |
Hi @AnaRhisT94, I can now successfully train the model and my loss goes to as little as 0.2. And still I can't detect any object. Note below the format of TRRecords in preparing and parsing them is a little different in order and I think it doesn't matter. I also tried lowering the confidence threshold and it didn't work as well. What could have gone wrong? Any idea? |
Please see this for transfer learning. https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md |
I have the same problem, I tried reducing the shuffle buffer_size to 8 without luck. The exact error I get is:
|
Finally I made it work, I left buffer_size as it was and set a really small batch size of 2, then it worked. |
Hi I am training on voc2012 dataset with around 5717 training and 5824 validation examples. After few epochs the system kills the training process. The system logs are attached. The environment details are as below,

OS: Ubuntu 18.04
GPU: 1080 Ti
Cuda: v10.0 with Cudnn v7.3.1
GPU Driver: v410.104
The text was updated successfully, but these errors were encountered: