Base learning rate decay on training loss #27

peastman · 2021-06-25T21:29:10Z

This changes it to base learning rate decay on training loss rather than validation loss. That gives a much cleaner signal for whether it is still learning. The practical effect is that you can use a smaller value for lr_patience, which leads to faster training.

In general, training loss tells you whether it is learning, and the difference between training loss and validation loss tells you whether it is overfitting. If the training loss stops decreasing, that means you need to reduce the learning rate. If the training loss is still decreasing but the validation loss stops going down, that means it is overfitting and you should stop. Reducing the learning rate won't help.

giadefa · 2021-06-26T07:32:19Z

The reduction on learning rate should be on validation, not on training loss. This is why there is a distinction between training, validation and test. For a large enough network training loss never increases. g

…

On Fri, Jun 25, 2021 at 11:29 PM Peter Eastman ***@***.***> wrote: This changes it to base learning rate decay on training loss rather than validation loss. That gives a much cleaner signal for whether it is still learning. The practical effect is that you can use a smaller value for lr_patience, which leads to faster training. In general, training loss tells you whether it is learning, and the difference between training loss and validation loss tells you whether it is overfitting. If the training loss stops decreasing, that means you need to reduce the learning rate. If the training loss is still decreasing but the validation loss stops going down, that means it is overfitting and you should stop. Reducing the learning rate won't help. ------------------------------ You can view, comment on, or merge this pull request online at: #27 Commit Summary - Base learning rate decay on training loss File Changes - *M* torchmdnet/module.py <https://github.com/compsciencelab/torchmd-net/pull/27/files#diff-fd3255a64c42e363ecb102409e22722c4ffe118f22076d0ebe54eaaa4ffa355c> (2) Patch Links: - https://github.com/compsciencelab/torchmd-net/pull/27.patch - https://github.com/compsciencelab/torchmd-net/pull/27.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOSIN33CDDWMDCDCAJTTUTYLBANCNFSM47KTBFMQ> .

peastman · 2021-06-26T18:41:32Z

The reduction on learning rate should be on validation, not on training loss.

No, it shouldn't! Like I said, validation loss is not a measure of whether it's learning. Validation loss is a measure of whether it's overfitting. Reducing the learning rate does nothing to help overfitting.

For a large enough network training loss never increases.

That is incorrect. If the learning rate is too high, the training loss will increase. This is precisely why training loss is a useful indicator of when you need to decrease the learning rate.

peastman · 2021-06-26T19:22:53Z

Let me elaborate a bit on the above. All SGD derived optimizers are first order methods that provide no guarantees on convergence. In most other fields, we use more sophisticated optimizers (quasi-Newton methods, L-BFGS, etc.). They either use second derivatives to estimate where the minimum is likely to be, or perform line searches to guarantee they never overshoot the minimum, or both. But those methods are only practical if you can accurately calculate the objective function and its derivatives. If you're just estimating it from mini-batches, they don't work well.

So SGD and its descendants just estimate the derivative, then take a fixed size step in that direction without having any idea how far away the minimum actually is. If they take too large a step, they'll overshoot it and the objective function will increase. That's why setting the learning rate too high prevents it from learning. Early on when it's far away from any minimum, you can take large steps. As it gets closer to a minimum (or to a saddle point, which is also a minimum along certain directions), the steps need to be smaller. But the optimizer has no idea how close the minimum is, and hence no idea how large its steps should be. So we add learning rate control algorithms on top of it to detect when the steps are too large and the learning rate needs to be decreased.

All of this is based on the objective function being minimized, that is, the training loss. It has nothing to do with the validation loss which is calculated for a totally different purpose (to detect overfitting). Reducing the learning rate has no effect on overfitting.

giadefa · 2021-06-27T08:10:07Z

I don't quite see your argument. In DL the larger the initial learning rate the better, the higher the noise in the gradients the better. NN can learn even random noise, so reducing the lr on the training set does not make any sense to me. Local minima are not a problem in DL at all, as they largely don't exist due to the high dimensional space. Indeed, even in the docs, they use the validation loss for the scheduler: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html#torch.optim.lr_scheduler.ReduceLROnPlateau You could try to use the training loss for the scheduler, but it is correct to use the validation loss for the scheduler and have the test loss to see overfitting, which is what we do.

…

On Sat, Jun 26, 2021 at 9:23 PM Peter Eastman ***@***.***> wrote: Let me elaborate a bit on the above. All SGD derived optimizers are first order methods that provide no guarantees on convergence. In most other fields, we use more sophisticated optimizers (quasi-Newton methods, L-BFGS, etc.). They either use second derivatives to estimate where the minimum is likely to be, or perform line searches to guarantee they never overshoot the minimum, or both. But those methods are only practical if you can accurately calculate the objective function and its derivatives. If you're just estimating it from mini-batches, they don't work well. So SGD and its descendants just estimate the derivative, then take a fixed size step in that direction without having any idea how far away the minimum actually is. If they take too large a step, they'll overshoot it and the objective function will increase. That's why setting the learning rate too high prevents it from learning. Early on when it's far away from any minimum, you can take large steps. As it gets closer to a minimum (or to a saddle point, which is also a minimum along certain directions), the steps need to be smaller. But the optimizer has no idea how close the minimum is, and hence no idea how large its steps should be. So we add learning rate control algorithms on top of it to detect when the steps are too large and the learning rate needs to be decreased. All of this is based on the objective function being minimized, that is, the training loss. It has nothing to do with the validation loss which is calculated for a totally different purpose (to detect overfitting). Reducing the learning rate has no effect on overfitting. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOQDZHVX4ENC6FLJSODTUYSJPANCNFSM47KTBFMQ> .

peastman · 2021-06-27T17:06:53Z

In DL the larger the initial learning rate the better, the higher the noise in the gradients the better.

That's very easy to disprove. Just set your learning rate to a huge value and watch the optimizer flail. It will totally fail to learn at all. This article does a nice job of describing the issue. High noise in gradients is also a problem that inhibits learning. That's why the larger the batch size (and hence the lower the noise), the larger your learning rate can be. Here are a couple of papers that discuss this and demonstrate very good training by using large batch sizes.

https://openreview.net/pdf?id=B1Yy1BxCZ
https://arxiv.org/pdf/1706.02677.pdf

Let me turn the question around: why do you think validation loss has anything to do with choosing the learning rate?

PhilippThoelke · 2021-06-28T10:52:38Z

A plateau in the training loss does not necessarily mean that the model stopped learning. With a constant training loss, the model might still move towards an area with better generalization, i.e. the validation loss is still decreasing. However, when the validation loss plateaus (it doesn't necessarily have to increase), we know that the model no longer improves generalization and it might be time to decrease the learning rate.
As the overall goal is to maximize generalization and not to minimize training loss, it makes sense to adapt the training behavior based on the validation loss, which is a metric for generalization. Otherwise it is very easy to "overfit" the learning rate on the training loss. This of course means that we can no longer use the validation loss in order to judge the model's performance but this is why we have a test set.

Regarding your comments on the initial learning rate and noise:
Of course it only makes sense to increase the learning rate up to a point where the model still learns but it is definitely beneficial to initially set it to the highest possible value. There are also measures to prevent the loss from exploding at the start of the training where the model might experience very steep gradients (e.g. lr warmup).

I wouldn't say that noisy gradients are strictly bad, they can also be beneficial for avoiding local minima. There is a trade off where a large batch size leads to fast training but smaller batches usually lead to better convergence. Ideally, we would choose a large batch size and keep a certain level of noise in the gradients to avoid getting stuck. This is also the message from the papers you linked. They are trying to use large batches (for fast training) while maintaining noisy gradients.

peastman · 2021-06-28T17:15:09Z

With a constant training loss, the model might still move towards an area with better generalization, i.e. the validation loss is still decreasing.

That isn't how optimizers work! An optimizer tries to minimize an objective function. That's the only thing it does. It moves in a direction that reduces its objective function. It has no concept of generalization, because the objective function provides no information about generalization.

In practice, generalization (that is, the gap between performance on training data and performance on other, unseen data) never decreases with training. Before you start training, the model is equally bad on any input. In a literal (although useless) sense, it generalizes perfectly. As you train, it gets better at predicting the training set. If you've chosen your training set well, it will also get better at predicting unseen data. But the only thing the optimizer is actually trying to do is improve performance on the training set. As a result, the training performance always improves faster than the performance on unseen data.

I wouldn't say that noisy gradients are strictly bad, they can also be beneficial for avoiding local minima. There is a trade off where a large batch size leads to fast training but smaller batches usually lead to better convergence.

That's what people used to think, but the consensus has now gone in the other direction. For one thing, like Gianni said, local minima mostly just aren't an issue. See for example https://arxiv.org/pdf/1406.2572.pdf. There are very few true local minima, and almost all of them are very close to the global minimum.

The modern approach is described in https://arxiv.org/pdf/1812.06162.pdf. Basically, there are two factors limiting the learning rate: noise in the gradients, and the curvature of the surface you're optimizing. At small batch sizes, noise is the dominant one. Increasing the batch size decreases the noise, which lets you use a larger learning rate. That continues up to a crossover point, where curvature becomes the limiting factor. Increasing the batch size further has no benefit. It's generally recommended that you aim for that crossover point. You want to decrease the noise right up to the point where noise is already low enough that there's no benefit to decreasing it further.

PhilippThoelke · 2021-07-01T11:39:19Z

Does using the training loss in the scheduler improve your performance? I tried training the same model twice with the same seed to compare the difference between scheduling the learning rate on the training vs validation loss. In my case the test loss of the model scheduled on the training loss was slightly higher. I also couldn't observe faster convergence.

Also, I was not able to find anything recommending scheduling the learning rate based on the training loss, everything I found is using the validation loss.

peastman · 2021-07-01T15:57:39Z

Here is a plot of the training loss and validation loss from a training run I did a few days ago.

Training loss gives a very clean signal while validation loss gives a very noisy one. In this run, I had lr_patience set to 3. If it were based on validation loss, I would have had to set it much higher because the validation loss is so much noisier. The examples in the repository set it to 15, which means that once the loss stops decreasing, it has to keep running for a minimum of 15 more epochs before it figures that out and reduces the learning rate. This change makes training a lot more efficient.

PhilippThoelke · 2021-07-02T20:30:09Z

Yes, it definitely converges faster that way. However, I'm noticing that due to quickly reducing the learning rate, it does not reach as low loss as it does with a larger lr_patience and based on the validation loss. I observed this for multiple values of lr_patience between 3 and 15. Both test and training loss converge too quickly this way, validation loss was around the same or even a bit lower.

peastman · 2021-07-04T17:06:18Z

In that case, you may want to increase lr_factor a bit. Or lr_patience. You still need to tune parameters to get optimal results, but it's easier to do that based on a clean signal than a noisy one.

peastman · 2021-07-04T17:29:39Z

It also could be interesting to try a learning rate policy that allows the rate to increase as well as decrease. That's pretty common with predefined rate schedules (e.g. linear cosine decay), but PyTorch doesn't seem to have any adaptive methods that do it. I might experiment with writing one.

PhilippThoelke · 2021-07-05T08:55:35Z

In that case, you may want to increase lr_factor a bit. Or lr_patience. You still need to tune parameters to get optimal results, but it's easier to do that based on a clean signal than a noisy one.

I did try a couple of different values and wasn't able to improve efficiency or loss. As it doesn't seem to be established that the lr scheduler should be based on the training loss and it is not a large effort for you to adjust that in your training, we will keep it as is for now.

It also could be interesting to try a learning rate policy that allows the rate to increase as well as decrease. That's pretty common with predefined rate schedules (e.g. linear cosine decay), but PyTorch doesn't seem to have any adaptive methods that do it. I might experiment with writing one.

I agree, this could be interesting. Have you seen these?
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html#torch.optim.lr_scheduler.CosineAnnealingLR

https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CyclicLR.html#torch.optim.lr_scheduler.CyclicLR

peastman · 2021-07-05T15:44:45Z

I did try a couple of different values and wasn't able to improve efficiency or loss.

Back when I first started using torchmd-net, I spent about a week just running tests of training protocols to try to speed up training, since @giadefa had told me that training on large datasets was prohibitively slow. I eventually settled on a protocol that was anywhere from 2 to 10 times faster that what he was using, depending on the dataset. Basing learning rate on training loss was absolutely essential to make that work.

As it doesn't seem to be established that the lr scheduler should be based on the training loss

Can you provide any theoretical justification for using validation loss? I just can't see any at all. Whereas the justifications for using training loss are obvious.

I agree, this could be interesting. Have you seen these?

Those are predefined learning rate schedules. You have to specify in advance exactly how the learning rate will change with time. I want to try an adaptive one, similar to ReduceLROnPlateau but allowing the rate to increase as well as decrease.

giadefa · 2021-07-05T16:11:08Z

Peter, can you remind me what you suggested back then? I remember the discussion not the contents. The rational for using validation is, I think, generalization. You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice. Imagine for instance that the distribution of labels is wider in the training set but very narrow in the test set. You can have a validation set which is similar to the test set.

…

On Mon, Jul 5, 2021 at 5:44 PM Peter Eastman ***@***.***> wrote: I did try a couple of different values and wasn't able to improve efficiency or loss. Back when I first started using torchmd-net, I spent about a week just running tests of training protocols to try to speed up training, since @giadefa <https://github.com/giadefa> had told me that training on large datasets was prohibitively slow. I eventually settled on a protocol that was anywhere from 2 to 10 times faster that what he was using, depending on the dataset. Basing learning rate on training loss was absolutely essential to make that work. As it doesn't seem to be established that the lr scheduler should be based on the training loss Can you provide any theoretical justification for using validation loss? I just can't see any at all. Whereas the justifications for using training loss are obvious. I agree, this could be interesting. Have you seen these? Those are predefined learning rate schedules. You have to specify in advance exactly how the learning rate will change with time. I want to try an adaptive one, similar to ReduceLROnPlateau but allowing the rate to increase as well as decrease. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOV62TWOVVMBWQSFYJ3TWHHPPANCNFSM47KTBFMQ> .

peastman · 2021-07-05T16:27:47Z

There are two main code changes required. One is to base the learning rate on training loss rather than validation loss. The other (which I also plan to send a PR for) is to provide an option for lr_patience and early_stopping_patience to be measured in batches rather than epochs. When training on a really big dataset like ANI, epochs are much too slow. You need to be able to adjust the learning rate multiple times within a single epoch, rather than having to wait many epochs to adjust it once. Otherwise, you artificially force your training time to be proportional to the dataset size.

You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice.

That's the reason to use validation loss for early stopping, yes. It tells you when you're starting to overfit. But what's the reason to use it for learning rate decay? If the training loss is still decreasing but the validation loss isn't, that means the model is still learning but it's overfitting to the training set. Reducing the learning rate won't help.

stefdoerr · 2021-07-06T07:05:29Z

The rational for using validation is, I think, generalization. You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice.

Just to chime in, I am on the same page as Peter. The learning rate, the way I learned it, is defined for the training set. When you are optimizing your network you are optimizing on the training set and the learning rate is used to improve the convergence of the minimizer on that specific surface that you are minimizing. Adjusting the learning rate by the validation set is weird in the sense that it doesn't necessarily relate with the convergence of the minimization of the training set at all.
Adjusting the learning rate is essentially: "was the step I took too big on this surface? make it smaller". You cannot say: "was my step too big for a different surface? make it smaller for another surface". This would only make sense to me if the two surfaces were identical or very very similar for which there is no real guarantee the way train/val splits are done.

giadefa · 2021-07-06T07:19:46Z

As I indicated before, I think that the issue is that nobody cares about the convergence of the training set, this is why they change the learning rate on the validation loss. The point is not to converge the optimization on the training data but rather to jump around in the validation set with reasonable gradient steps until we obtain the best loss. Having the learning rate on the training set would make the training set converge, but maybe it was better to reduce the learning rate way before. The early stopping on the validation only stops the training and does not get the extra smaller steps. g

…

On Tue, Jul 6, 2021 at 9:05 AM Stefan Doerr ***@***.***> wrote: The rational for using validation is, I think, generalization. You really want the training rate to optimize your future test performance, so using a validation set is an obvious choice. Just to chime in, I am on the same page as Peter. The learning rate, the way I learned it, is defined for the training set. When you are optimizing your network you are optimizing on the training set and the learning rate is used to improve the convergence of the minimizer on that specific surface that you are minimizing. Adjusting the learning rate by the validation set is weird in the sense that it doesn't necessarily relate with the convergence of the minimization of the training set at all. Adjusting the learning rate is essentially: "was the step I took too big on this surface? make it smaller". You cannot say: "was my step too big for a different surface? make it smaller for another surface". This would only make sense to me if the two surfaces were identical. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOXGAUQXCCVPSR3U57DTWKTMHANCNFSM47KTBFMQ> .

stefdoerr · 2021-07-06T07:25:35Z

At that point why not optimize on the validation set? (just kidding). I understand that nobody cares on the convergence on the training set but changing the learning rate depending on the validation set does not make the validation set converge either because you are moving on top of a totally different surface there.
So what you are saying is you want to choose the direction of the minimization step from the training surface and the size of the step from the validation surface. I don't see how these relate, excepting the case where the two surfaces are identical.

giadefa · 2021-07-06T07:31:21Z

that's the point of the reducing the learning rate on plateau, yes. If you prefer not to do that you can use a fixed scheduler, every xxxx steps reduce it which is what most people do too and probably it does involve the same level of trial and error. g

…

On Tue, Jul 6, 2021 at 9:25 AM Stefan Doerr ***@***.***> wrote: At that point why not optimize on the validation set? (just kidding). I understand that nobody cares on the convergence on the training set but changing the learning rate depending on the validation set does not make the validation set converge either because you are moving on top of a totally different surface there. So what you are saying is you want to choose the direction of the minimization from the training surface and the size of the step from the validation surface. I don't see how these relate, excepting the case where the two surfaces are identical. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOQ4UVOGSBT6V2SRUGLTWKVXXANCNFSM47KTBFMQ> .

peastman · 2021-07-06T15:59:07Z

If the training loss is still decreasing but the validation loss has stopped decreasing, why do you think it would help to decrease the learning rate? That's really the algorithmic question we're talking about here.

peastman · 2021-07-07T18:14:03Z

I came across this article which claims Adam and ReduceLROnPlateau are incompatible with each other, and you should never use them together. I'm not convinced that's actually true. It doesn't match my own experience. Still, I'll try some tests of the method he recommends, which is OneCycleLR with SGD.

giadefa · 2021-07-08T08:08:09Z

It actually makes some sense. We should probably try different ways to do the training. We have been using Adams with ReduceLROnPlateau without a particular motivation, just because it is simple. But we do now spend considerable effort in optimizing the accuracy and performance. However, it does practically work to reduce the maximum lr even with Adams. Maybe it is due to the fact that Adams as lr for each parameters and by setting the top one we provide some sort of global information. g

…

On Wed, Jul 7, 2021 at 8:14 PM Peter Eastman ***@***.***> wrote: I came across this article <https://spell.ml/blog/lr-schedulers-and-adaptive-optimizers-YHmwMhAAACYADm6F> which claims Adam and ReduceLROnPlateau are incompatible with each other, and you should never use them together. I'm not convinced that's actually true. It doesn't match my own experience. Still, I'll try some tests of the method he recommends, which is OneCycleLR with SGD. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOXV3OFMD54PDK5XXGLTWSKPNANCNFSM47KTBFMQ> .

peastman · 2021-07-08T15:46:39Z

Here's my understanding of it. Adam is able to control its effective learning rate, but only over a limited range (typically about an order of magnitude). That makes it easier to pick a learning rate, since it's more tolerant of a choice that isn't quite optimal. But it's very common to use a schedule where the final learning rate is smaller than the initial one by more than an order of magnitude. That's a bigger range than what the optimizer can handle automatically. And even if it could, you would be back to having to precisely tune the hyperparameters, since you would need the optimizer's range to cover both the largest and smallest values you wanted.

peastman · 2021-07-09T00:40:59Z

I tried OneCycleLR, but it didn't seem to work very well. Initially the learning rate is much lower, so it learns slowly. Once the learning rate gets up into the range where I would normally start it, it begins to display the blow-ups I described in #29. But the schedule is fixed rather than adaptive, so it doesn't respond by lowering the learning rate. That means it doesn't recover from them as well, and they keep happening.

So far, the best strategy I've found is to start with a very large learning rate, then be quick to reduce it at any sign of loss increasing. That gives really fast training and a very low final loss.

peastman · 2022-05-11T18:39:03Z

I want to make another push for this change. Can we at least make it a supported option? It makes a huge difference to training speed.

I consistently find that the fastest way to train models is to start with a high learning rate, then reduce it aggressively when training stalls (lr_patience set to 1 or 2 and lr_factor around 0.5). But that only works when basing it on training loss. Validation loss is much too noisy, so you have to wait much longer before reducing the learning rate. That leads to much slower training

giadefa · 2022-05-11T21:07:00Z

I am fine with that, if we can make it an option.

…

On Wed, May 11, 2022 at 8:39 PM Peter Eastman ***@***.***> wrote: I want to make another push for this change. Can we at least make it a supported option? It makes a huge difference to training speed. I consistently find that the fastest way to train models is to start with a high learning rate, then reduce it aggressively when training stalls ( lr_patience set to 1 or 2 and lr_factor around 0.5). But that only works when basing it on training loss. Validation loss is much too noisy, so you have to wait much longer before reducing the learning rate. That leads to much slower training — Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOX47X7GH7T5UE5ERX3VJP5FHANCNFSM47KTBFMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

peastman · 2022-05-17T16:07:09Z

This is superseded by #89.

Base learning rate decay on training loss

3dda351

peastman mentioned this pull request Jul 8, 2021

Ability to roll back an epoch #29

Open

peastman mentioned this pull request May 13, 2022

Added option for which metric to base learning rate decay on #89

Merged

peastman closed this May 17, 2022

peastman deleted the trainloss branch May 17, 2022 16:07

PhilippThoelke mentioned this pull request Aug 8, 2023

Updating to Lightning 2.0 #210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base learning rate decay on training loss #27

Base learning rate decay on training loss #27

peastman commented Jun 25, 2021

giadefa commented Jun 26, 2021 via email

peastman commented Jun 26, 2021

peastman commented Jun 26, 2021

giadefa commented Jun 27, 2021 via email

peastman commented Jun 27, 2021

PhilippThoelke commented Jun 28, 2021

peastman commented Jun 28, 2021

PhilippThoelke commented Jul 1, 2021

peastman commented Jul 1, 2021

PhilippThoelke commented Jul 2, 2021

peastman commented Jul 4, 2021

peastman commented Jul 4, 2021

PhilippThoelke commented Jul 5, 2021

peastman commented Jul 5, 2021

giadefa commented Jul 5, 2021 via email

peastman commented Jul 5, 2021

stefdoerr commented Jul 6, 2021 •

edited

Loading

giadefa commented Jul 6, 2021 via email

stefdoerr commented Jul 6, 2021 •

edited

Loading

giadefa commented Jul 6, 2021 via email

peastman commented Jul 6, 2021

peastman commented Jul 7, 2021

giadefa commented Jul 8, 2021 via email

peastman commented Jul 8, 2021

peastman commented Jul 9, 2021

peastman commented May 11, 2022

giadefa commented May 11, 2022 via email

peastman commented May 17, 2022

Base learning rate decay on training loss #27

Base learning rate decay on training loss #27

Conversation

peastman commented Jun 25, 2021

giadefa commented Jun 26, 2021 via email

peastman commented Jun 26, 2021

peastman commented Jun 26, 2021

giadefa commented Jun 27, 2021 via email

peastman commented Jun 27, 2021

PhilippThoelke commented Jun 28, 2021

peastman commented Jun 28, 2021

PhilippThoelke commented Jul 1, 2021

peastman commented Jul 1, 2021

PhilippThoelke commented Jul 2, 2021

peastman commented Jul 4, 2021

peastman commented Jul 4, 2021

PhilippThoelke commented Jul 5, 2021

peastman commented Jul 5, 2021

giadefa commented Jul 5, 2021 via email

peastman commented Jul 5, 2021

stefdoerr commented Jul 6, 2021 • edited Loading

giadefa commented Jul 6, 2021 via email

stefdoerr commented Jul 6, 2021 • edited Loading

giadefa commented Jul 6, 2021 via email

peastman commented Jul 6, 2021

peastman commented Jul 7, 2021

giadefa commented Jul 8, 2021 via email

peastman commented Jul 8, 2021

peastman commented Jul 9, 2021

peastman commented May 11, 2022

giadefa commented May 11, 2022 via email

peastman commented May 17, 2022

stefdoerr commented Jul 6, 2021 •

edited

Loading

stefdoerr commented Jul 6, 2021 •

edited

Loading