Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss gets low but cannot detect anything in inference, even on training set #148

Open
ttocs167 opened this issue Jan 8, 2020 · 19 comments

Comments

@ttocs167
Copy link

ttocs167 commented Jan 8, 2020

I am training on a custom dataset using the darknet transfer option and the loss values drop very low and stop after a few epochs:

loss: 22.6205 - yolo_output_0_loss: 1.1167 - yolo_output_1_loss: 7.4397 - yolo_output_2_loss: 0.0091- val_loss: 16.5518 - val_yolo_output_0_loss: 0.146 - val_yolo_output_1_loss: 2.7821 - val_yolo_output_2_loss: 0.0090

but I cannot detect anything even when using images from the training set...

I have tried all of the different transfer modes, but using "fine_tune" and "no_ouput" both give errors on startup, so "none" and "darknet" are all I can use.

I also have the Unresolved object in checkpoint: issue even though I am using the --weights_num_classes option correctly. I see that you force supress that message on detect.py using .expect_partial() when loading weights, but I am suspicious this is still causing issues by failing to load the model correctly.

@TheClassyPenguin
Copy link

TheClassyPenguin commented Jan 8, 2020

I have the same problem here, was about to post this.

I can confirm in my case that the problem isn't with the dataset. The loading weights problem seems interesting, I'll look into it. Please let me know if you find anything interesting.

Edit: This problem could be related to #126 and #20 will try re-training in eager mode as I was using fit.

@ttocs167
Copy link
Author

ttocs167 commented Jan 8, 2020

I have tried to train with eager_fit and I get the same results, unfortunately it doesn't seem to make a difference for me.

Because the loss gets so low I'm confident that the training has gone well, but I don't know how to check that any other way. Surely if the loss is so small then it would at least be able to perform on the training set once trained.

I know there is no problem with the data labels from using the "visualize_dataset.py" script in the repo.

Edit: Seems like the warnings on loading/saving weights are not an issue according to info from #108 . now I'm really unsure as to why I can't detect anything.

@TheClassyPenguin
Copy link

TheClassyPenguin commented Jan 8, 2020

Just to be thorough:

Could you double check you are using the right ./checkpoints/yolov3_train_X.tf weights instead of the base YOLO model ./checkpoints/yolov3.tf for inference?

Using the base model on the new data also expectedly returns nothing for me since I'm using a custom class but it behaves similarly.

@ttocs167
Copy link
Author

ttocs167 commented Jan 8, 2020

Yep, im definitely using the models that are saved during training. I've trained up to 50 epochs with quite a few difference setting combinations and I've tried using a load of different ones incase it was overfitting or something and the later epochs were bad. I can't seem to get it to detect anything with any.

I can get the pretrained model to work on the example images though.

@TheClassyPenguin
Copy link

TheClassyPenguin commented Jan 8, 2020

I'm not seeing anything particularly weird when I explore the weights. The last layers change among epochs as they should. So that's probably not it.

@ttocs167
Copy link
Author

ttocs167 commented Jan 8, 2020

Is there any way to tune the confidence values? It's possible what I consider to be a low loss level is not low at all and the training never converged. When I'm testing maybe it's just not confident enough to pick anything out, even though it was trained on those images.

@TheClassyPenguin
Copy link

TheClassyPenguin commented Jan 8, 2020

The threshold may be changed by modifying this define in Model. Let me know if it helps!

flags.DEFINE_float('yolo_score_threshold', 0.5, 'score threshold')

@TheClassyPenguin
Copy link

TheClassyPenguin commented Jan 8, 2020

@ttocs167 Scott it works! It looks like the max confidence is 0.5 for some reason? I had my classes being predicted with a confidence of 0.47.

@Robin2091
Copy link

Robin2091 commented Jan 8, 2020

@TheClassyPenguin check #70, if you are training with 1 class. It explains how to change the code so the confidence scores go up with one class.

@zzh8829
Copy link
Owner

zzh8829 commented Jan 8, 2020

i just saw in the documentation of loss function https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy?version=stable
it says "Use this crossentropy loss function when there are two or more label classes", so i guess we need to modify the loss function for single class 😢

@zzh8829
Copy link
Owner

zzh8829 commented Jan 8, 2020

It's weird because there is no class loss when you only have one class so I think the solution provided by @nuitvolgit

if classes > 1:
  scores = confidence * class_probs
else:
  scores = confidence

seems reasonable to me, I will add this to the output function later today

@ttocs167
Copy link
Author

ttocs167 commented Jan 9, 2020

@TheClassyPenguin Thats great! I'm glad you've figured it out. I'm going to retrain my custom set again and try lowering the confidence. I'm a little worried though because my custom set has 5 classes so it shouldn't be effected by this loss issue.

@ttocs167
Copy link
Author

ttocs167 commented Jan 9, 2020

So I have discovered that reducing the confidence certainly does make detections show up, however they are basically nonesense and all very low confidence despite the fact that the primary loss metric drops down to ~5 during the end of training...

I'm going to train again without some image augmentation I added to the dataset using tf.image.random functions just in case they are messing up my training. But it makes no sense to me how the loss can get so low and yet the predictions on those same images are so poor.

@Robin2091
Copy link

@zzh8829 I trained with 1 class sparse_categorical_crossentropy. Would this make a huge difference? I thought binary is just a special case of sparse so for 1 class it is mathematically the same.

@ttocs167
Copy link
Author

ttocs167 commented Jan 10, 2020

So it turns out my training works reasonably well if I take off the image augmentation I had. I'm getting confidence values around 0.5 to 0.7 on the training set, which is no great, but it at least means it's working. I'm starting to think that the image augmentation does not adjust the labels in the same ways and thus ruins the training. No idea why the loss values still drop really low if its giving completely garbage results though.

I was using tf.image functions for augmentation, is there a way to augment the data procedurally in the input pipeline? I had my functions in like so:

image

I'm not really sure how these functions interact with the bounding box data, the contrast changes should obviously not have any effect but training was not successful with that line included. I'm assuming I implemented them wrong and they ruined the data in a way where the loss was still good but the outputs are bad.

Edit: With contrast adjustments only the accuracy shoots up to ~0.8. It seems to be finally working as intended. It seems my data augmentation was killing the data/label pairs somehow so that they were mismatched and the network was unable to learn anything. I'm going to keep trying to get my other augmentation steps working so that I can make it even more robust.

@zzh8829
Copy link
Owner

zzh8829 commented Jan 12, 2020

@ttocs167 In your data augmentation, the transformations are only applied to image. Any transformation that changes the image geometry would need to be applied to labels as well. Constrast change works because it doesn't shift the bounding boxes, but others like flipping would.

I don't think tensorflow natively provide augmentation for both images and bounding box. The build-in image augmenters were designed with classification task and generative model in mind. Defining custom augmenter is not trivial you can check out this documentation from imgaug https://imgaug.readthedocs.io/en/latest/source/examples_bounding_boxes.html

@ttocs167
Copy link
Author

@zzh8829 I suspected as much, but when I read up on the tf.image functions I found this stack overflow thread stating that the bounding boxes should be augmented in the same way. Looking at my code now there is no way the labels are even passed into the function so it obviously isnt working. I plan on looking into this and still using tf.image if it's as easy as passing the labels into the function as well.

@guangmingdexin
Copy link

So how should this problem be solved? Is it enough to adjust the threshold? Sorry, this is very important to me.

@aashay96
Copy link

aashay96 commented May 6, 2020

I am having the same issue. Loss gets low, but no detections are happening. Can anyone here help?
I have also lowered the score threholds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants