error message while training("nan values" ) #135

miladfa7 · 2019-12-28T19:47:23Z

I execute this command
python3 train.py --dataset ./train.tfrecord --classes ./data/voc2012.names --num_classes 1 --mode fit --transfer darknet --batch_size 8 --epochs 10 --weights ./checkpoints/yolov3-tiny.tf --weights_num_classes 80 --tiny

but I get is this error message while training("nan values" ).

I only have 1 class in my training dataset.(Drone detection)

Do this error is related to convert txt to tfrecord?
I converted my dataset based on this "#41"
please help me, thanks.

The text was updated successfully, but these errors were encountered:

Robin2091 · 2019-12-28T21:18:19Z

Hi @miladfa7,

I experienced this problem when training with 1 class too. However, I had my labelled images in separate tfrecords rather than 1 large tfrecord.

This issue arises because certain data(labelled images) are "corrupted"(which causes the loss to go to NAN). In my case, since I had labeled images in separate tfrecords, certain tfrecords were "corrupted". I noticed this because some tfrecords caused nan values while others didn't. So what I would suggest is having all of your labeled images in individual tfrecords. Then create a script that would iterate through each tfrecord(in other words , run the training on 1 tfrecord at a time) to find the corrupted tfrecords. In my case, there weren't a lot of corrupted tfrecords(around 30 for 2000 images). So I would suggest first iterating over batches of tfrecords(e.g. 25 or 50 at a time) and then locate all of the batches that give an NAN loss value. After you have located all of the file batches that give NAN, iterate through all of them 1 by 1 to find the corrupted files. This worked for me, I got rid of the corrupted tfrecords, and the training went fine. I haven't figured out why certain tfrecords are "corrupted" but i just got rid of the ones that give NAN loss.

I would suggest looking at #103 and #86 for more info.

Also if you are training with yolo tiny can you update me on the mAP and accuracy of your algorithm. I only get around a 40 percent accuracy with 5000 images. I get a 65 percent mAP with 0.3 iou.

zzh8829 · 2019-12-30T00:11:41Z

can you run this script and check if your dataset is in the proper format?
python tools/visualize_dataset.py --dataset ./train.tfrecor --classes=./data/voc2012.names

miladfa7 · 2019-12-30T15:21:27Z

can you run this script and check if your dataset is in the proper format?
python tools/visualize_dataset.py --dataset ./train.tfrecor --classes=./data/voc2012.names

I executed this command
output : Drone, 1, [0.641026 0.617857 0.517949 0.414286]

The problem is that the bounding box is not drawn correctly...
I normalized the bounding box but didn't normalize the images. Could this be the problem?

When I don't normalize the bounding box, there are no bounding boxes in the output image
output: Drone, 1, [131.5 129. 251. 224. ]

miladfa7 · 2020-01-01T08:47:39Z

@zzh8829 @Robin2091
I solved this problem #135 (comment)
output:

I training model with 1 class and Instead of sparse_categorical_crossentropy use binary_crossentropy(...) and change the num_classes FLAG on the train.py to 1 class.

python3 train.py --dataset ./train.tfrecord --classes ./data/voc2012.names --num_classes 1 --mode fit --transfer darknet --batch_size 10 --epochs 10 --weights ./checkpoints/yolov3-tiny.tf --weights_num_classes 80 --tiny

The loss output was negative values when I executed this command. (my dataset include 350 drones image)

What could be the reason?

another question is how to improve the accuracy of the detection new image?

AnaRhisT94 · 2020-01-21T09:57:28Z

first of all use train_dataset = train_dataset.repeat() and same for validation.
nan values happen if your annotations are not x_min, x_max, y_min, y_max.
If your becomes negative, there's something really wrong.
Anyways, use a lower learning rate parameter, try 1e-4, because on 1e-3 u might overshoot.

lihualilee · 2020-05-23T02:32:57Z

@zzh8829 @robin 2091
我解决了这个问题#135(评论)
产出：

1类训练模型，用二进制交叉熵代替稀疏分类交叉熵(.)并将列车.py上的num_class标志更改为1类。

python3 train.py --dataset ./train.tfrecord --classes ./data/voc2012.names --num_classes 1 --mode fit --transfer darknet --batch_size 10 --epochs 10 --weights ./checkpoints/yolov3-tiny.tf --weights_num_classes 80 --tiny

损失产出是负值时，执行此命令。(我的数据集包括350架无人机图像)

原因是什么？

另一个问题是如何提高新图像检测的准确性？

我执行以下命令
python3 train.py --dataset ./train.tfrecord --classes ./data/voc2012.names --num_classes 1 --mode fit --transfer darknet --batch_size 8 --epochs 10 --weights ./checkpoints/yolov3-tiny.tf --weights_num_classes 80 --tiny

但是我得到的是训练时的错误信息(“NaN值”)。

我的训练数据集中只有一个类别。(无人机探测)

此错误是否与将txt转换为tfRecord有关？
我在此基础上转换了我的数据集#41"
请帮帮我，谢谢。

I also encountered this problem. when the category is 1, the loss also appears Nan. later, I changed the loss function according to your method. later, the loss also appears negative. how can I solve this problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error message while training("nan values" ) #135

error message while training("nan values" ) #135

miladfa7 commented Dec 28, 2019 •

edited

Loading

Robin2091 commented Dec 28, 2019 •

edited

Loading

zzh8829 commented Dec 30, 2019

miladfa7 commented Dec 30, 2019 •

edited

Loading

miladfa7 commented Jan 1, 2020 •

edited

Loading

AnaRhisT94 commented Jan 21, 2020

lihualilee commented May 23, 2020

error message while training("nan values" ) #135

error message while training("nan values" ) #135

Comments

miladfa7 commented Dec 28, 2019 • edited Loading

Robin2091 commented Dec 28, 2019 • edited Loading

zzh8829 commented Dec 30, 2019

miladfa7 commented Dec 30, 2019 • edited Loading

miladfa7 commented Jan 1, 2020 • edited Loading

AnaRhisT94 commented Jan 21, 2020

lihualilee commented May 23, 2020

miladfa7 commented Dec 28, 2019 •

edited

Loading

Robin2091 commented Dec 28, 2019 •

edited

Loading

miladfa7 commented Dec 30, 2019 •

edited

Loading

miladfa7 commented Jan 1, 2020 •

edited

Loading