reg loss became Nan when it came to 2.6k iters #47

mxmxlwlw · 2017-05-15T03:24:31Z

Hi,
It seems that the reg loss of training process become Nan, when it comes to 2.6k iters.
Besides, how can I make use of the network to test my own imgs?
best wishes!

kevinkit · 2017-05-15T09:32:56Z

Hello @mxmxlwlw ,

There are a lot of issues at the moment facing the problem, that the training will be NaN or stop after some iterations, did you take a look at them and found that this is a complete new issue ?

See #42 , #24

Also did you already get a snapshot of the trained weights or does the computation stop before that point?

amirbar · 2017-05-17T13:44:33Z

Hi,

I think i'm getting the same behaviour. I have an overflow in the function bbox_transform.py. right after the overflow the reg loss is jumping until it becomes nan. I came up with some fix which seems to work. can you please look and tell whether you get the same behaviour?

if yes, I will propose a PR.

iter 267: image-id:0123208, time:0.817(sec), regular_loss: 0.214897, total-loss 1.0351(0.0118, 0.3499, 0.001303, 0.0411, 0.6309), instances: 1, batch:(20|104, 2|66, 2|2)
[ 1 640 853 3]
iter 268: image-id:0477321, time:0.727(sec), regular_loss: 0.215178, total-loss 135.3920(0.5606, 81.0835, 0.000000, 53.7479, 0.0000), instances: 2, batch:(20|96, 0|64, 0|0)
[ 1 640 853 3]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
iter 269: image-id:0215226, time:0.715(sec), regular_loss: 0.216160, total-loss 183.3571(5.7902, 177.5669, 0.000000, 0.0000, 0.0000), instances: 2, batch:(2|34, 0|64, 0|0)
[ 1 640 1137 3]
iter 270: image-id:0477310, time:2.707(sec), regular_loss: 0.224796, total-loss 1331875328.0000(38611.8828, 1331836672.0000, 0.000000, 0.0000, 0.0000), instances: 2, batch:(2|34, 0|64, 0|0)
[ 1 640 963 3]
iter 271: image-id:0057707, time:0.770(sec), regular_loss: 486502989824.000000, total-loss nan(0.0088, 0.4111, nan, nan, 0.6301), instances: 1, batch:(15|84, 1|65, 1|1)
[[ 225.38453674 596.21038818 397.67379761 897.72412109 76. ]]

sheldon606 · 2017-05-21T09:45:21Z

I came across the same problem too.

amirbar · 2017-05-21T09:49:53Z

@CharlesShang can you please review/comment?

mxmxlwlw · 2017-05-26T10:47:28Z

@amirbar Hi, problem solved! But how can I make use of the network to test my own imgs and get the rects and masks?

blitu12345 · 2017-06-03T17:51:10Z

@mxmxlwlw please share your solution, how did you modify your code to stop regular-layer from becoming Nan? thanks!

mxmxlwlw · 2017-06-04T01:34:26Z

@blitu12345 Hi, they already changed the code in github, just download it, and normally, it will be ok. If you still meet the problem sometimes. Just lower the learning rate. It may works.

mxmxlwlw · 2017-06-04T01:37:13Z

@blitu12345 And there may still be some bugs in the training code.

blitu12345 · 2017-06-04T05:44:13Z

Already using the updated code, initially i got nan values at 1500 iterations but now i m at 3000 and its working fine.Dont know how this works.By the way thanks mate !!!

…

On Jun 4, 2017 7:07 AM, "mxmxlwlw" ***@***.***> wrote: @blitu12345 <https://github.com/blitu12345> And there may still be some bugs in the training code. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOosN28cnYuoA4j6P3qbQDBSoe3L5-xpks5sAgpMgaJpZM4NaoIO> .

mxmxlwlw · 2017-06-05T03:46:09Z

@blitu12345 Yeah, they commit with comment "Change computation for numerical stability". However, there may still be some bugs... And I really looking forward they giving some sample code for testing their network. Just one image would be fine.

blitu12345 · 2017-06-05T05:46:07Z

@mxmxlwlw have you trained your model ? i m just at 120k iteration and its already more than 24 hrs, seems like it going take a long time to train.How much time did your model took to train?Are they storing and saving the trained model at successive interval in the source code ?Thanks !!

amirbar · 2017-06-05T07:19:29Z

@mxmxlwlw I wrote a short code for bounding box visualization I can PR
There are still bugs. I'm currently testing only the RPN component and it seems to work with few code fixes and hyper params search. I will try to PR today

The repository seems far from reproducing the original work

mxmxlwlw · 2017-06-06T09:54:50Z

@blitu12345 I just use the original code for training. And yes, it took long time to train.

mxmxlwlw · 2017-06-06T09:55:46Z

@amirbar Wow, thank you for your share! You help me a log.

amirbar · 2017-06-06T10:09:23Z

@mxmxlwlw , according to the experiments I performed, training will not lead you to anything unless you merge #50, which at least will get you the RPN component working.

anyway, because this thread issue is resolved can you please close? there are too many issues to track anyway :)

mxmxlwlw · 2017-06-06T10:55:46Z

Ok.

meetps · 2018-02-19T08:04:19Z

Check my comment here

amirbar mentioned this issue May 21, 2017

Change computation for numerical stability #58

Merged

mxmxlwlw closed this as completed Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reg loss became Nan when it came to 2.6k iters #47

reg loss became Nan when it came to 2.6k iters #47

mxmxlwlw commented May 15, 2017

kevinkit commented May 15, 2017

amirbar commented May 17, 2017

sheldon606 commented May 21, 2017

amirbar commented May 21, 2017

mxmxlwlw commented May 26, 2017

blitu12345 commented Jun 3, 2017

mxmxlwlw commented Jun 4, 2017

mxmxlwlw commented Jun 4, 2017

blitu12345 commented Jun 4, 2017 via email

mxmxlwlw commented Jun 5, 2017

blitu12345 commented Jun 5, 2017 •

edited

Loading

amirbar commented Jun 5, 2017

mxmxlwlw commented Jun 6, 2017

mxmxlwlw commented Jun 6, 2017

amirbar commented Jun 6, 2017 •

edited

Loading

mxmxlwlw commented Jun 6, 2017

meetps commented Feb 19, 2018

reg loss became Nan when it came to 2.6k iters #47

reg loss became Nan when it came to 2.6k iters #47

Comments

mxmxlwlw commented May 15, 2017

kevinkit commented May 15, 2017

amirbar commented May 17, 2017

sheldon606 commented May 21, 2017

amirbar commented May 21, 2017

mxmxlwlw commented May 26, 2017

blitu12345 commented Jun 3, 2017

mxmxlwlw commented Jun 4, 2017

mxmxlwlw commented Jun 4, 2017

blitu12345 commented Jun 4, 2017 via email

mxmxlwlw commented Jun 5, 2017

blitu12345 commented Jun 5, 2017 • edited Loading

amirbar commented Jun 5, 2017

mxmxlwlw commented Jun 6, 2017

mxmxlwlw commented Jun 6, 2017

amirbar commented Jun 6, 2017 • edited Loading

mxmxlwlw commented Jun 6, 2017

meetps commented Feb 19, 2018

blitu12345 commented Jun 5, 2017 •

edited

Loading

amirbar commented Jun 6, 2017 •

edited

Loading