-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reg loss became Nan when it came to 2.6k iters #47
Comments
Hello @mxmxlwlw , There are a lot of issues at the moment facing the problem, that the training will be NaN or stop after some iterations, did you take a look at them and found that this is a complete new issue ? Also did you already get a snapshot of the trained weights or does the computation stop before that point? |
Hi, I think i'm getting the same behaviour. I have an overflow in the function bbox_transform.py. right after the overflow the reg loss is jumping until it becomes nan. I came up with some fix which seems to work. can you please look and tell whether you get the same behaviour? if yes, I will propose a PR. iter 267: image-id:0123208, time:0.817(sec), regular_loss: 0.214897, total-loss 1.0351(0.0118, 0.3499, 0.001303, 0.0411, 0.6309), instances: 1, batch:(20|104, 2|66, 2|2) |
I came across the same problem too. |
@CharlesShang can you please review/comment? |
@amirbar Hi, problem solved! But how can I make use of the network to test my own imgs and get the rects and masks? |
@mxmxlwlw please share your solution, how did you modify your code to stop regular-layer from becoming Nan? thanks! |
@blitu12345 Hi, they already changed the code in github, just download it, and normally, it will be ok. If you still meet the problem sometimes. Just lower the learning rate. It may works. |
@blitu12345 And there may still be some bugs in the training code. |
Already using the updated code, initially i got nan values at 1500
iterations but now i m at 3000 and its working fine.Dont know how this
works.By the way thanks mate !!!
…On Jun 4, 2017 7:07 AM, "mxmxlwlw" ***@***.***> wrote:
@blitu12345 <https://github.com/blitu12345> And there may still be some
bugs in the training code.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AOosN28cnYuoA4j6P3qbQDBSoe3L5-xpks5sAgpMgaJpZM4NaoIO>
.
|
@blitu12345 Yeah, they commit with comment "Change computation for numerical stability". However, there may still be some bugs... And I really looking forward they giving some sample code for testing their network. Just one image would be fine. |
@mxmxlwlw have you trained your model ? i m just at 120k iteration and its already more than 24 hrs, seems like it going take a long time to train.How much time did your model took to train?Are they storing and saving the trained model at successive interval in the source code ?Thanks !! |
@mxmxlwlw I wrote a short code for bounding box visualization I can PR The repository seems far from reproducing the original work |
@blitu12345 I just use the original code for training. And yes, it took long time to train. |
@amirbar Wow, thank you for your share! You help me a log. |
Ok. |
Check my comment here |
Hi,
It seems that the reg loss of training process become Nan, when it comes to 2.6k iters.
Besides, how can I make use of the network to test my own imgs?
best wishes!
The text was updated successfully, but these errors were encountered: