Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan loss in RCAN model #12

Open
Ken1256 opened this issue Mar 2, 2019 · 9 comments
Open

Nan loss in RCAN model #12

Ken1256 opened this issue Mar 2, 2019 · 9 comments

Comments

@Ken1256
Copy link

Ken1256 commented Mar 2, 2019

https://github.com/wayne391/Image-Super-Resolution/blob/master/src/models/RCAN.py

Just change
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, amsgrad=False)
to
optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=0.1)

Nan loss in RCAN model, but Adam work fine.

@Luolc
Copy link
Owner

Luolc commented Mar 3, 2019

Hi! Thanks for sharing the failure case!
I will try to reproduce the result using your code. Do you know how much resource it needs for training?

@Ken1256
Copy link
Author

Ken1256 commented Mar 3, 2019

I find out if torch.nn.L1Loss(reduction='mean') AdaBound work fine, But torch.nn.L1Loss(reduction='sum') Nan loss. (Sorry, after double check the code I change reduction='mean' to reduction='sum', But Adam both work fine. Normally 'mean' or 'sum' should be the same.)

The resource depan img patch_size, set args.n_resgroups = 3 and args.n_resblocks = 2 will much faster and less VRAM.

@Luolc
Copy link
Owner

Luolc commented Mar 3, 2019

Thanks for more details.

In this case, I guess AdaBound is a little bit sensitive on RCAN model, and a final_lr of 0.1 is too large. You may try some smaller final_lr like 0.03, 0.01, 0.003, and etc. But I am not familiar with this model and cannot make sure it would work.

@Ken1256
Copy link
Author

Ken1256 commented Mar 3, 2019

I try optimizer = adabound.AdaBound(model.parameters(), lr=1e-4, final_lr=1e-4) still Nan loss.

@Luolc
Copy link
Owner

Luolc commented Mar 3, 2019

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

@GreatGBL
Copy link

GreatGBL commented Mar 4, 2019

1e-4 might be too small ...

If I understand correctly, the only difference between mean and sum is a scale of N (the count of samples in a step). If AdaBound can work with mean, then reducing the learning rate with a scale of N should work too. But I am not sure whether it should be lr or final_lr or both. I just had a discussion with my schoolmates on a seminar today about which one is more important in training, the early stage or the final stage. However we haven't come out a clear answer yet. So we have to test it through experiments right now.

Not exactly correct, suppose If dataset A has 101 data, and the batchsize is set as 10.
If we setting the reduction as mean, There is no problem with this.
Otherwise, once the last batch which has only one data, and it effect the learning rate

@Luolc
Copy link
Owner

Luolc commented Mar 4, 2019

I believe that's a very extreme case. Generally, a single step won't affect the whole training process, on expectation.

In this case, we would encounter a much less gradient once an epoch when using sum. If this does affect the training, I think the dataset is too small and SGD will fail either.

@MitraTj
Copy link

MitraTj commented Apr 24, 2019

hi, i use torch version 0.3.1. and just I modified
optimizer = optim.Adam(params, weight_decay=conf.l2, lr=lr, eps=1e-3) to
optimizer = adabound.AdaBound(params, weight_decay=conf.l2, lr=lr, final_lr=0.1, eps=1e-3)

when I ran it I faced raise ImportError("torch.utils.ffi is deprecated).

Would you help?
Thanks

@Michael-J98
Copy link

hi, I‘m a beginner, and I have a small question about it:
The adabound was inspired by gradient_clip while clipping happens on the lr rather than the gradient.
So does it mean that I still need to clip the gradient before feeding it into optimizer to prevent the gradient becoming Nan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants