-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error message while training("nan values" ) #135
Comments
Hi @miladfa7, I experienced this problem when training with 1 class too. However, I had my labelled images in separate tfrecords rather than 1 large tfrecord. This issue arises because certain data(labelled images) are "corrupted"(which causes the loss to go to NAN). In my case, since I had labeled images in separate tfrecords, certain tfrecords were "corrupted". I noticed this because some tfrecords caused nan values while others didn't. So what I would suggest is having all of your labeled images in individual tfrecords. Then create a script that would iterate through each tfrecord(in other words , run the training on 1 tfrecord at a time) to find the corrupted tfrecords. In my case, there weren't a lot of corrupted tfrecords(around 30 for 2000 images). So I would suggest first iterating over batches of tfrecords(e.g. 25 or 50 at a time) and then locate all of the batches that give an NAN loss value. After you have located all of the file batches that give NAN, iterate through all of them 1 by 1 to find the corrupted files. This worked for me, I got rid of the corrupted tfrecords, and the training went fine. I haven't figured out why certain tfrecords are "corrupted" but i just got rid of the ones that give NAN loss. I would suggest looking at #103 and #86 for more info. Also if you are training with yolo tiny can you update me on the mAP and accuracy of your algorithm. I only get around a 40 percent accuracy with 5000 images. I get a 65 percent mAP with 0.3 iou. |
can you run this script and check if your dataset is in the proper format? |
@zzh8829 @Robin2091 I training model with 1 class and Instead of sparse_categorical_crossentropy use binary_crossentropy(...) and change the num_classes FLAG on the train.py to 1 class.
The loss output was negative values when I executed this command. (my dataset include 350 drones image) What could be the reason? another question is how to improve the accuracy of the detection new image? |
first of all use |
I also encountered this problem. when the category is 1, the loss also appears Nan. later, I changed the loss function according to your method. later, the loss also appears negative. how can I solve this problem |
I execute this command
python3 train.py --dataset ./train.tfrecord --classes ./data/voc2012.names --num_classes 1 --mode fit --transfer darknet --batch_size 8 --epochs 10 --weights ./checkpoints/yolov3-tiny.tf --weights_num_classes 80 --tiny
but I get is this error message while training("nan values" ).

I only have 1 class in my training dataset.(Drone detection)
Do this error is related to convert txt to tfrecord?
I converted my dataset based on this "#41"
please help me, thanks.
The text was updated successfully, but these errors were encountered: