Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base learning rate decay on training loss #27

Closed
wants to merge 1 commit into from
Closed

Conversation

peastman
Copy link
Collaborator

This changes it to base learning rate decay on training loss rather than validation loss. That gives a much cleaner signal for whether it is still learning. The practical effect is that you can use a smaller value for lr_patience, which leads to faster training.

In general, training loss tells you whether it is learning, and the difference between training loss and validation loss tells you whether it is overfitting. If the training loss stops decreasing, that means you need to reduce the learning rate. If the training loss is still decreasing but the validation loss stops going down, that means it is overfitting and you should stop. Reducing the learning rate won't help.

@giadefa
Copy link
Contributor

giadefa commented Jun 26, 2021 via email

@peastman
Copy link
Collaborator Author

The reduction on learning rate should be on validation, not on training loss.

No, it shouldn't! Like I said, validation loss is not a measure of whether it's learning. Validation loss is a measure of whether it's overfitting. Reducing the learning rate does nothing to help overfitting.

For a large enough network training loss never increases.

That is incorrect. If the learning rate is too high, the training loss will increase. This is precisely why training loss is a useful indicator of when you need to decrease the learning rate.

@peastman
Copy link
Collaborator Author

Let me elaborate a bit on the above. All SGD derived optimizers are first order methods that provide no guarantees on convergence. In most other fields, we use more sophisticated optimizers (quasi-Newton methods, L-BFGS, etc.). They either use second derivatives to estimate where the minimum is likely to be, or perform line searches to guarantee they never overshoot the minimum, or both. But those methods are only practical if you can accurately calculate the objective function and its derivatives. If you're just estimating it from mini-batches, they don't work well.

So SGD and its descendants just estimate the derivative, then take a fixed size step in that direction without having any idea how far away the minimum actually is. If they take too large a step, they'll overshoot it and the objective function will increase. That's why setting the learning rate too high prevents it from learning. Early on when it's far away from any minimum, you can take large steps. As it gets closer to a minimum (or to a saddle point, which is also a minimum along certain directions), the steps need to be smaller. But the optimizer has no idea how close the minimum is, and hence no idea how large its steps should be. So we add learning rate control algorithms on top of it to detect when the steps are too large and the learning rate needs to be decreased.

All of this is based on the objective function being minimized, that is, the training loss. It has nothing to do with the validation loss which is calculated for a totally different purpose (to detect overfitting). Reducing the learning rate has no effect on overfitting.

@giadefa
Copy link
Contributor

giadefa commented Jun 27, 2021 via email

@peastman
Copy link
Collaborator Author

In DL the larger the initial learning rate the better, the higher the noise in the gradients the better.

That's very easy to disprove. Just set your learning rate to a huge value and watch the optimizer flail. It will totally fail to learn at all. This article does a nice job of describing the issue. High noise in gradients is also a problem that inhibits learning. That's why the larger the batch size (and hence the lower the noise), the larger your learning rate can be. Here are a couple of papers that discuss this and demonstrate very good training by using large batch sizes.

https://openreview.net/pdf?id=B1Yy1BxCZ
https://arxiv.org/pdf/1706.02677.pdf

Let me turn the question around: why do you think validation loss has anything to do with choosing the learning rate?

@PhilippThoelke
Copy link
Collaborator

A plateau in the training loss does not necessarily mean that the model stopped learning. With a constant training loss, the model might still move towards an area with better generalization, i.e. the validation loss is still decreasing. However, when the validation loss plateaus (it doesn't necessarily have to increase), we know that the model no longer improves generalization and it might be time to decrease the learning rate.
As the overall goal is to maximize generalization and not to minimize training loss, it makes sense to adapt the training behavior based on the validation loss, which is a metric for generalization. Otherwise it is very easy to "overfit" the learning rate on the training loss. This of course means that we can no longer use the validation loss in order to judge the model's performance but this is why we have a test set.

Regarding your comments on the initial learning rate and noise:
Of course it only makes sense to increase the learning rate up to a point where the model still learns but it is definitely beneficial to initially set it to the highest possible value. There are also measures to prevent the loss from exploding at the start of the training where the model might experience very steep gradients (e.g. lr warmup).

I wouldn't say that noisy gradients are strictly bad, they can also be beneficial for avoiding local minima. There is a trade off where a large batch size leads to fast training but smaller batches usually lead to better convergence. Ideally, we would choose a large batch size and keep a certain level of noise in the gradients to avoid getting stuck. This is also the message from the papers you linked. They are trying to use large batches (for fast training) while maintaining noisy gradients.

@peastman
Copy link
Collaborator Author

With a constant training loss, the model might still move towards an area with better generalization, i.e. the validation loss is still decreasing.

That isn't how optimizers work! An optimizer tries to minimize an objective function. That's the only thing it does. It moves in a direction that reduces its objective function. It has no concept of generalization, because the objective function provides no information about generalization.

In practice, generalization (that is, the gap between performance on training data and performance on other, unseen data) never decreases with training. Before you start training, the model is equally bad on any input. In a literal (although useless) sense, it generalizes perfectly. As you train, it gets better at predicting the training set. If you've chosen your training set well, it will also get better at predicting unseen data. But the only thing the optimizer is actually trying to do is improve performance on the training set. As a result, the training performance always improves faster than the performance on unseen data.

I wouldn't say that noisy gradients are strictly bad, they can also be beneficial for avoiding local minima. There is a trade off where a large batch size leads to fast training but smaller batches usually lead to better convergence.

That's what people used to think, but the consensus has now gone in the other direction. For one thing, like Gianni said, local minima mostly just aren't an issue. See for example https://arxiv.org/pdf/1406.2572.pdf. There are very few true local minima, and almost all of them are very close to the global minimum.

The modern approach is described in https://arxiv.org/pdf/1812.06162.pdf. Basically, there are two factors limiting the learning rate: noise in the gradients, and the curvature of the surface you're optimizing. At small batch sizes, noise is the dominant one. Increasing the batch size decreases the noise, which lets you use a larger learning rate. That continues up to a crossover point, where curvature becomes the limiting factor. Increasing the batch size further has no benefit. It's generally recommended that you aim for that crossover point. You want to decrease the noise right up to the point where noise is already low enough that there's no benefit to decreasing it further.

@PhilippThoelke
Copy link
Collaborator

Does using the training loss in the scheduler improve your performance? I tried training the same model twice with the same seed to compare the difference between scheduling the learning rate on the training vs validation loss. In my case the test loss of the model scheduled on the training loss was slightly higher. I also couldn't observe faster convergence.

Also, I was not able to find anything recommending scheduling the learning rate based on the training loss, everything I found is using the validation loss.

@peastman
Copy link
Collaborator Author

peastman commented Jul 1, 2021

Here is a plot of the training loss and validation loss from a training run I did a few days ago.

image

Training loss gives a very clean signal while validation loss gives a very noisy one. In this run, I had lr_patience set to 3. If it were based on validation loss, I would have had to set it much higher because the validation loss is so much noisier. The examples in the repository set it to 15, which means that once the loss stops decreasing, it has to keep running for a minimum of 15 more epochs before it figures that out and reduces the learning rate. This change makes training a lot more efficient.

@PhilippThoelke
Copy link
Collaborator

Yes, it definitely converges faster that way. However, I'm noticing that due to quickly reducing the learning rate, it does not reach as low loss as it does with a larger lr_patience and based on the validation loss. I observed this for multiple values of lr_patience between 3 and 15. Both test and training loss converge too quickly this way, validation loss was around the same or even a bit lower.

@peastman
Copy link
Collaborator Author

peastman commented Jul 4, 2021

In that case, you may want to increase lr_factor a bit. Or lr_patience. You still need to tune parameters to get optimal results, but it's easier to do that based on a clean signal than a noisy one.

@peastman
Copy link
Collaborator Author

peastman commented Jul 4, 2021

It also could be interesting to try a learning rate policy that allows the rate to increase as well as decrease. That's pretty common with predefined rate schedules (e.g. linear cosine decay), but PyTorch doesn't seem to have any adaptive methods that do it. I might experiment with writing one.

@PhilippThoelke
Copy link
Collaborator

In that case, you may want to increase lr_factor a bit. Or lr_patience. You still need to tune parameters to get optimal results, but it's easier to do that based on a clean signal than a noisy one.

I did try a couple of different values and wasn't able to improve efficiency or loss. As it doesn't seem to be established that the lr scheduler should be based on the training loss and it is not a large effort for you to adjust that in your training, we will keep it as is for now.

It also could be interesting to try a learning rate policy that allows the rate to increase as well as decrease. That's pretty common with predefined rate schedules (e.g. linear cosine decay), but PyTorch doesn't seem to have any adaptive methods that do it. I might experiment with writing one.

I agree, this could be interesting. Have you seen these?
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR

https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CyclicLR.html#torch.optim.lr_scheduler.CyclicLR

@peastman
Copy link
Collaborator Author

peastman commented Jul 5, 2021

I did try a couple of different values and wasn't able to improve efficiency or loss.

Back when I first started using torchmd-net, I spent about a week just running tests of training protocols to try to speed up training, since @giadefa had told me that training on large datasets was prohibitively slow. I eventually settled on a protocol that was anywhere from 2 to 10 times faster that what he was using, depending on the dataset. Basing learning rate on training loss was absolutely essential to make that work.

As it doesn't seem to be established that the lr scheduler should be based on the training loss

Can you provide any theoretical justification for using validation loss? I just can't see any at all. Whereas the justifications for using training loss are obvious.

I agree, this could be interesting. Have you seen these?

Those are predefined learning rate schedules. You have to specify in advance exactly how the learning rate will change with time. I want to try an adaptive one, similar to ReduceLROnPlateau but allowing the rate to increase as well as decrease.

@giadefa
Copy link
Contributor

giadefa commented Jul 5, 2021 via email

@peastman
Copy link
Collaborator Author

peastman commented Jul 5, 2021

There are two main code changes required. One is to base the learning rate on training loss rather than validation loss. The other (which I also plan to send a PR for) is to provide an option for lr_patience and early_stopping_patience to be measured in batches rather than epochs. When training on a really big dataset like ANI, epochs are much too slow. You need to be able to adjust the learning rate multiple times within a single epoch, rather than having to wait many epochs to adjust it once. Otherwise, you artificially force your training time to be proportional to the dataset size.

You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice.

That's the reason to use validation loss for early stopping, yes. It tells you when you're starting to overfit. But what's the reason to use it for learning rate decay? If the training loss is still decreasing but the validation loss isn't, that means the model is still learning but it's overfitting to the training set. Reducing the learning rate won't help.

@stefdoerr
Copy link
Collaborator

stefdoerr commented Jul 6, 2021

The rational for using validation is, I think, generalization. You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice.

Just to chime in, I am on the same page as Peter. The learning rate, the way I learned it, is defined for the training set. When you are optimizing your network you are optimizing on the training set and the learning rate is used to improve the convergence of the minimizer on that specific surface that you are minimizing. Adjusting the learning rate by the validation set is weird in the sense that it doesn't necessarily relate with the convergence of the minimization of the training set at all.
Adjusting the learning rate is essentially: "was the step I took too big on this surface? make it smaller". You cannot say: "was my step too big for a different surface? make it smaller for another surface". This would only make sense to me if the two surfaces were identical or very very similar for which there is no real guarantee the way train/val splits are done.

@giadefa
Copy link
Contributor

giadefa commented Jul 6, 2021 via email

@stefdoerr
Copy link
Collaborator

stefdoerr commented Jul 6, 2021

At that point why not optimize on the validation set? (just kidding). I understand that nobody cares on the convergence on the training set but changing the learning rate depending on the validation set does not make the validation set converge either because you are moving on top of a totally different surface there.
So what you are saying is you want to choose the direction of the minimization step from the training surface and the size of the step from the validation surface. I don't see how these relate, excepting the case where the two surfaces are identical.

@giadefa
Copy link
Contributor

giadefa commented Jul 6, 2021 via email

@peastman
Copy link
Collaborator Author

peastman commented Jul 6, 2021

If the training loss is still decreasing but the validation loss has stopped decreasing, why do you think it would help to decrease the learning rate? That's really the algorithmic question we're talking about here.

@peastman
Copy link
Collaborator Author

peastman commented Jul 7, 2021

I came across this article which claims Adam and ReduceLROnPlateau are incompatible with each other, and you should never use them together. I'm not convinced that's actually true. It doesn't match my own experience. Still, I'll try some tests of the method he recommends, which is OneCycleLR with SGD.

@giadefa
Copy link
Contributor

giadefa commented Jul 8, 2021 via email

@peastman
Copy link
Collaborator Author

peastman commented Jul 8, 2021

Here's my understanding of it. Adam is able to control its effective learning rate, but only over a limited range (typically about an order of magnitude). That makes it easier to pick a learning rate, since it's more tolerant of a choice that isn't quite optimal. But it's very common to use a schedule where the final learning rate is smaller than the initial one by more than an order of magnitude. That's a bigger range than what the optimizer can handle automatically. And even if it could, you would be back to having to precisely tune the hyperparameters, since you would need the optimizer's range to cover both the largest and smallest values you wanted.

@peastman
Copy link
Collaborator Author

peastman commented Jul 9, 2021

I tried OneCycleLR, but it didn't seem to work very well. Initially the learning rate is much lower, so it learns slowly. Once the learning rate gets up into the range where I would normally start it, it begins to display the blow-ups I described in #29. But the schedule is fixed rather than adaptive, so it doesn't respond by lowering the learning rate. That means it doesn't recover from them as well, and they keep happening.

So far, the best strategy I've found is to start with a very large learning rate, then be quick to reduce it at any sign of loss increasing. That gives really fast training and a very low final loss.

@peastman
Copy link
Collaborator Author

I want to make another push for this change. Can we at least make it a supported option? It makes a huge difference to training speed.

I consistently find that the fastest way to train models is to start with a high learning rate, then reduce it aggressively when training stalls (lr_patience set to 1 or 2 and lr_factor around 0.5). But that only works when basing it on training loss. Validation loss is much too noisy, so you have to wait much longer before reducing the learning rate. That leads to much slower training

@giadefa
Copy link
Contributor

giadefa commented May 11, 2022 via email

@peastman
Copy link
Collaborator Author

This is superseded by #89.

@peastman peastman closed this May 17, 2022
@peastman peastman deleted the trainloss branch May 17, 2022 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants