How to execute eagerly to find nan-loss cause? #306

fredrikorn · 2021-05-01T21:29:53Z

Hi!
I've been using this repo on my own dataset and I have encountered the problem with the loss suddenly hitting nan, even though it was converging nicely before (as in #198 )
After printing some things in the tensorflow graph I'm quite sure the error comes from weird values on box width and height, but I haven't managed to pinpoint it.

To check it I thought I'd try running the program eagerly with tf.compat.v1.enable_eager_execution() but it results in the error 'get_session' is not available when TensorFlow is executing eagerly.

Is it either possible to run it eagerly in some way or has anyone figured out the reason for the sudden nan-loss?

The text was updated successfully, but these errors were encountered:

fredrikorn · 2021-05-12T12:40:02Z

If someone else runs into this issue, I found the nan-loss coming from the tf.sqrt gradient diverging close to zero (see this post ). I tackled this by adding a small epsilon value 1e-7 in dummy_loss in yolo.py.

Regarding the eager execution I haven't solved it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to execute eagerly to find nan-loss cause? #306

How to execute eagerly to find nan-loss cause? #306

fredrikorn commented May 1, 2021

fredrikorn commented May 12, 2021 •

edited

Loading

How to execute eagerly to find nan-loss cause? #306

How to execute eagerly to find nan-loss cause? #306

Comments

fredrikorn commented May 1, 2021

fredrikorn commented May 12, 2021 • edited Loading

fredrikorn commented May 12, 2021 •

edited

Loading