-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base learning rate decay on training loss #27
Conversation
The reduction on learning rate should be on validation, not on training
loss. This is why there is a distinction between training, validation and
test.
For a large enough network training loss never increases.
g
…On Fri, Jun 25, 2021 at 11:29 PM Peter Eastman ***@***.***> wrote:
This changes it to base learning rate decay on training loss rather than
validation loss. That gives a much cleaner signal for whether it is still
learning. The practical effect is that you can use a smaller value for
lr_patience, which leads to faster training.
In general, training loss tells you whether it is learning, and the
difference between training loss and validation loss tells you whether it
is overfitting. If the training loss stops decreasing, that means you need
to reduce the learning rate. If the training loss is still decreasing but
the validation loss stops going down, that means it is overfitting and you
should stop. Reducing the learning rate won't help.
------------------------------
You can view, comment on, or merge this pull request online at:
#27
Commit Summary
- Base learning rate decay on training loss
File Changes
- *M* torchmdnet/module.py
<https://github.com/compsciencelab/torchmd-net/pull/27/files#diff-fd3255a64c42e363ecb102409e22722c4ffe118f22076d0ebe54eaaa4ffa355c>
(2)
Patch Links:
- https://github.com/compsciencelab/torchmd-net/pull/27.patch
- https://github.com/compsciencelab/torchmd-net/pull/27.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOSIN33CDDWMDCDCAJTTUTYLBANCNFSM47KTBFMQ>
.
|
No, it shouldn't! Like I said, validation loss is not a measure of whether it's learning. Validation loss is a measure of whether it's overfitting. Reducing the learning rate does nothing to help overfitting.
That is incorrect. If the learning rate is too high, the training loss will increase. This is precisely why training loss is a useful indicator of when you need to decrease the learning rate. |
Let me elaborate a bit on the above. All SGD derived optimizers are first order methods that provide no guarantees on convergence. In most other fields, we use more sophisticated optimizers (quasi-Newton methods, L-BFGS, etc.). They either use second derivatives to estimate where the minimum is likely to be, or perform line searches to guarantee they never overshoot the minimum, or both. But those methods are only practical if you can accurately calculate the objective function and its derivatives. If you're just estimating it from mini-batches, they don't work well. So SGD and its descendants just estimate the derivative, then take a fixed size step in that direction without having any idea how far away the minimum actually is. If they take too large a step, they'll overshoot it and the objective function will increase. That's why setting the learning rate too high prevents it from learning. Early on when it's far away from any minimum, you can take large steps. As it gets closer to a minimum (or to a saddle point, which is also a minimum along certain directions), the steps need to be smaller. But the optimizer has no idea how close the minimum is, and hence no idea how large its steps should be. So we add learning rate control algorithms on top of it to detect when the steps are too large and the learning rate needs to be decreased. All of this is based on the objective function being minimized, that is, the training loss. It has nothing to do with the validation loss which is calculated for a totally different purpose (to detect overfitting). Reducing the learning rate has no effect on overfitting. |
I don't quite see your argument. In DL the larger the
initial learning rate the better, the higher the noise in the gradients the
better. NN can learn even random noise, so reducing the lr on the training
set does not make any sense to me. Local minima are not a problem in DL at
all, as they largely don't exist due to the high dimensional space.
Indeed, even in the docs, they use the validation loss for the scheduler:
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html#torch.optim.lr_scheduler.ReduceLROnPlateau
You could try to use the training loss for the scheduler, but it is correct
to use the validation loss for the scheduler and have the test loss to see
overfitting, which is what we do.
…On Sat, Jun 26, 2021 at 9:23 PM Peter Eastman ***@***.***> wrote:
Let me elaborate a bit on the above. All SGD derived optimizers are first
order methods that provide no guarantees on convergence. In most other
fields, we use more sophisticated optimizers (quasi-Newton methods, L-BFGS,
etc.). They either use second derivatives to estimate where the minimum is
likely to be, or perform line searches to guarantee they never overshoot
the minimum, or both. But those methods are only practical if you can
accurately calculate the objective function and its derivatives. If you're
just estimating it from mini-batches, they don't work well.
So SGD and its descendants just estimate the derivative, then take a fixed
size step in that direction without having any idea how far away the
minimum actually is. If they take too large a step, they'll overshoot it
and the objective function will increase. That's why setting the learning
rate too high prevents it from learning. Early on when it's far away from
any minimum, you can take large steps. As it gets closer to a minimum (or
to a saddle point, which is also a minimum along certain directions), the
steps need to be smaller. But the optimizer has no idea how close the
minimum is, and hence no idea how large its steps should be. So we add
learning rate control algorithms on top of it to detect when the steps are
too large and the learning rate needs to be decreased.
All of this is based on the objective function being minimized, that is,
the training loss. It has nothing to do with the validation loss which is
calculated for a totally different purpose (to detect overfitting).
Reducing the learning rate has no effect on overfitting.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOQDZHVX4ENC6FLJSODTUYSJPANCNFSM47KTBFMQ>
.
|
That's very easy to disprove. Just set your learning rate to a huge value and watch the optimizer flail. It will totally fail to learn at all. This article does a nice job of describing the issue. High noise in gradients is also a problem that inhibits learning. That's why the larger the batch size (and hence the lower the noise), the larger your learning rate can be. Here are a couple of papers that discuss this and demonstrate very good training by using large batch sizes. https://openreview.net/pdf?id=B1Yy1BxCZ Let me turn the question around: why do you think validation loss has anything to do with choosing the learning rate? |
A plateau in the training loss does not necessarily mean that the model stopped learning. With a constant training loss, the model might still move towards an area with better generalization, i.e. the validation loss is still decreasing. However, when the validation loss plateaus (it doesn't necessarily have to increase), we know that the model no longer improves generalization and it might be time to decrease the learning rate. Regarding your comments on the initial learning rate and noise: I wouldn't say that noisy gradients are strictly bad, they can also be beneficial for avoiding local minima. There is a trade off where a large batch size leads to fast training but smaller batches usually lead to better convergence. Ideally, we would choose a large batch size and keep a certain level of noise in the gradients to avoid getting stuck. This is also the message from the papers you linked. They are trying to use large batches (for fast training) while maintaining noisy gradients. |
That isn't how optimizers work! An optimizer tries to minimize an objective function. That's the only thing it does. It moves in a direction that reduces its objective function. It has no concept of generalization, because the objective function provides no information about generalization. In practice, generalization (that is, the gap between performance on training data and performance on other, unseen data) never decreases with training. Before you start training, the model is equally bad on any input. In a literal (although useless) sense, it generalizes perfectly. As you train, it gets better at predicting the training set. If you've chosen your training set well, it will also get better at predicting unseen data. But the only thing the optimizer is actually trying to do is improve performance on the training set. As a result, the training performance always improves faster than the performance on unseen data.
That's what people used to think, but the consensus has now gone in the other direction. For one thing, like Gianni said, local minima mostly just aren't an issue. See for example https://arxiv.org/pdf/1406.2572.pdf. There are very few true local minima, and almost all of them are very close to the global minimum. The modern approach is described in https://arxiv.org/pdf/1812.06162.pdf. Basically, there are two factors limiting the learning rate: noise in the gradients, and the curvature of the surface you're optimizing. At small batch sizes, noise is the dominant one. Increasing the batch size decreases the noise, which lets you use a larger learning rate. That continues up to a crossover point, where curvature becomes the limiting factor. Increasing the batch size further has no benefit. It's generally recommended that you aim for that crossover point. You want to decrease the noise right up to the point where noise is already low enough that there's no benefit to decreasing it further. |
Does using the training loss in the scheduler improve your performance? I tried training the same model twice with the same seed to compare the difference between scheduling the learning rate on the training vs validation loss. In my case the test loss of the model scheduled on the training loss was slightly higher. I also couldn't observe faster convergence. Also, I was not able to find anything recommending scheduling the learning rate based on the training loss, everything I found is using the validation loss. |
Here is a plot of the training loss and validation loss from a training run I did a few days ago. Training loss gives a very clean signal while validation loss gives a very noisy one. In this run, I had |
Yes, it definitely converges faster that way. However, I'm noticing that due to quickly reducing the learning rate, it does not reach as low loss as it does with a larger |
In that case, you may want to increase |
It also could be interesting to try a learning rate policy that allows the rate to increase as well as decrease. That's pretty common with predefined rate schedules (e.g. linear cosine decay), but PyTorch doesn't seem to have any adaptive methods that do it. I might experiment with writing one. |
I did try a couple of different values and wasn't able to improve efficiency or loss. As it doesn't seem to be established that the lr scheduler should be based on the training loss and it is not a large effort for you to adjust that in your training, we will keep it as is for now.
I agree, this could be interesting. Have you seen these? |
Back when I first started using torchmd-net, I spent about a week just running tests of training protocols to try to speed up training, since @giadefa had told me that training on large datasets was prohibitively slow. I eventually settled on a protocol that was anywhere from 2 to 10 times faster that what he was using, depending on the dataset. Basing learning rate on training loss was absolutely essential to make that work.
Can you provide any theoretical justification for using validation loss? I just can't see any at all. Whereas the justifications for using training loss are obvious.
Those are predefined learning rate schedules. You have to specify in advance exactly how the learning rate will change with time. I want to try an adaptive one, similar to ReduceLROnPlateau but allowing the rate to increase as well as decrease. |
Peter, can you remind me what you suggested back then? I remember the
discussion not the contents.
The rational for using validation is, I think, generalization. You really
want the training rate to optimize your future test performance, so using a
validation set is an obvious choice. Imagine for instance that the
distribution of labels is wider in the training set but very narrow in the
test set. You can have a validation set which is similar to the test set.
…On Mon, Jul 5, 2021 at 5:44 PM Peter Eastman ***@***.***> wrote:
I did try a couple of different values and wasn't able to improve
efficiency or loss.
Back when I first started using torchmd-net, I spent about a week just
running tests of training protocols to try to speed up training, since
@giadefa <https://github.com/giadefa> had told me that training on large
datasets was prohibitively slow. I eventually settled on a protocol that
was anywhere from 2 to 10 times faster that what he was using, depending on
the dataset. Basing learning rate on training loss was absolutely essential
to make that work.
As it doesn't seem to be established that the lr scheduler should be based
on the training loss
Can you provide any theoretical justification for using validation loss? I
just can't see any at all. Whereas the justifications for using training
loss are obvious.
I agree, this could be interesting. Have you seen these?
Those are predefined learning rate schedules. You have to specify in
advance exactly how the learning rate will change with time. I want to try
an adaptive one, similar to ReduceLROnPlateau but allowing the rate to
increase as well as decrease.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOV62TWOVVMBWQSFYJ3TWHHPPANCNFSM47KTBFMQ>
.
|
There are two main code changes required. One is to base the learning rate on training loss rather than validation loss. The other (which I also plan to send a PR for) is to provide an option for
That's the reason to use validation loss for early stopping, yes. It tells you when you're starting to overfit. But what's the reason to use it for learning rate decay? If the training loss is still decreasing but the validation loss isn't, that means the model is still learning but it's overfitting to the training set. Reducing the learning rate won't help. |
Just to chime in, I am on the same page as Peter. The learning rate, the way I learned it, is defined for the training set. When you are optimizing your network you are optimizing on the training set and the learning rate is used to improve the convergence of the minimizer on that specific surface that you are minimizing. Adjusting the learning rate by the validation set is weird in the sense that it doesn't necessarily relate with the convergence of the minimization of the training set at all. |
As I indicated before, I think that the issue is that nobody cares about
the convergence of the training set, this is why they change the learning
rate on the validation loss. The point is not to converge the optimization
on the training data but rather to jump around in the validation set with
reasonable gradient steps until we obtain the best loss.
Having the learning rate on the training set would make the training set
converge, but maybe it was better to reduce the learning rate way before.
The early stopping on the validation only stops the training and does not
get the extra smaller steps.
g
…On Tue, Jul 6, 2021 at 9:05 AM Stefan Doerr ***@***.***> wrote:
The rational for using validation is, I think, generalization. You really
want the training rate to optimize your future test performance, so using a
validation set is an obvious choice.
Just to chime in, I am on the same page as Peter. The learning rate, the
way I learned it, is defined for the training set. When you are optimizing
your network you are optimizing on the training set and the learning rate
is used to improve the convergence of the minimizer on that specific
surface that you are minimizing. Adjusting the learning rate by the
validation set is weird in the sense that it doesn't necessarily relate
with the convergence of the minimization of the training set at all.
Adjusting the learning rate is essentially: "was the step I took too big
on this surface? make it smaller". You cannot say: "was my step too big for
a different surface? make it smaller for another surface". This would only
make sense to me if the two surfaces were identical.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOXGAUQXCCVPSR3U57DTWKTMHANCNFSM47KTBFMQ>
.
|
At that point why not optimize on the validation set? (just kidding). I understand that nobody cares on the convergence on the training set but changing the learning rate depending on the validation set does not make the validation set converge either because you are moving on top of a totally different surface there. |
that's the point of the reducing the learning rate on plateau, yes. If you
prefer not to do that you can use a fixed scheduler, every xxxx steps
reduce it which is what most people do too and probably it does involve the
same level of trial and error.
g
…On Tue, Jul 6, 2021 at 9:25 AM Stefan Doerr ***@***.***> wrote:
At that point why not optimize on the validation set? (just kidding). I
understand that nobody cares on the convergence on the training set but
changing the learning rate depending on the validation set does not make
the validation set converge either because you are moving on top of a
totally different surface there.
So what you are saying is you want to choose the direction of the
minimization from the training surface and the size of the step from the
validation surface. I don't see how these relate, excepting the case where
the two surfaces are identical.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOQ4UVOGSBT6V2SRUGLTWKVXXANCNFSM47KTBFMQ>
.
|
If the training loss is still decreasing but the validation loss has stopped decreasing, why do you think it would help to decrease the learning rate? That's really the algorithmic question we're talking about here. |
I came across this article which claims Adam and ReduceLROnPlateau are incompatible with each other, and you should never use them together. I'm not convinced that's actually true. It doesn't match my own experience. Still, I'll try some tests of the method he recommends, which is OneCycleLR with SGD. |
It actually makes some sense. We should probably try different ways to do
the training. We have been using Adams with ReduceLROnPlateau without a
particular motivation, just because it is simple. But we do now spend
considerable effort in optimizing the accuracy and performance.
However, it does practically work to reduce the maximum lr even with Adams.
Maybe it is due to the fact that Adams as lr for each parameters and by
setting the top one we provide some sort of global information.
g
…On Wed, Jul 7, 2021 at 8:14 PM Peter Eastman ***@***.***> wrote:
I came across this article
<https://spell.ml/blog/lr-schedulers-and-adaptive-optimizers-YHmwMhAAACYADm6F>
which claims Adam and ReduceLROnPlateau are incompatible with each other,
and you should never use them together. I'm not convinced that's actually
true. It doesn't match my own experience. Still, I'll try some tests of the
method he recommends, which is OneCycleLR with SGD.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOXV3OFMD54PDK5XXGLTWSKPNANCNFSM47KTBFMQ>
.
|
Here's my understanding of it. Adam is able to control its effective learning rate, but only over a limited range (typically about an order of magnitude). That makes it easier to pick a learning rate, since it's more tolerant of a choice that isn't quite optimal. But it's very common to use a schedule where the final learning rate is smaller than the initial one by more than an order of magnitude. That's a bigger range than what the optimizer can handle automatically. And even if it could, you would be back to having to precisely tune the hyperparameters, since you would need the optimizer's range to cover both the largest and smallest values you wanted. |
I tried OneCycleLR, but it didn't seem to work very well. Initially the learning rate is much lower, so it learns slowly. Once the learning rate gets up into the range where I would normally start it, it begins to display the blow-ups I described in #29. But the schedule is fixed rather than adaptive, so it doesn't respond by lowering the learning rate. That means it doesn't recover from them as well, and they keep happening. So far, the best strategy I've found is to start with a very large learning rate, then be quick to reduce it at any sign of loss increasing. That gives really fast training and a very low final loss. |
I want to make another push for this change. Can we at least make it a supported option? It makes a huge difference to training speed. I consistently find that the fastest way to train models is to start with a high learning rate, then reduce it aggressively when training stalls ( |
I am fine with that, if we can make it an option.
…On Wed, May 11, 2022 at 8:39 PM Peter Eastman ***@***.***> wrote:
I want to make another push for this change. Can we at least make it a
supported option? It makes a huge difference to training speed.
I consistently find that the fastest way to train models is to start with
a high learning rate, then reduce it aggressively when training stalls (
lr_patience set to 1 or 2 and lr_factor around 0.5). But that only works
when basing it on training loss. Validation loss is much too noisy, so you
have to wait much longer before reducing the learning rate. That leads to
much slower training
—
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOX47X7GH7T5UE5ERX3VJP5FHANCNFSM47KTBFMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is superseded by #89. |
This changes it to base learning rate decay on training loss rather than validation loss. That gives a much cleaner signal for whether it is still learning. The practical effect is that you can use a smaller value for
lr_patience
, which leads to faster training.In general, training loss tells you whether it is learning, and the difference between training loss and validation loss tells you whether it is overfitting. If the training loss stops decreasing, that means you need to reduce the learning rate. If the training loss is still decreasing but the validation loss stops going down, that means it is overfitting and you should stop. Reducing the learning rate won't help.