Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi Andrej,
I recently ran the trainer demo on MNIST and wondered why the Adam optimizer performs so much more worse than Adadelta.
I think I found a little bug in the Adam implementation.
According to the Adam Paper-v8 https://arxiv.org/pdf/1412.6980v8.pdf Algorithm 1 (p. 2) the bias estimates use division instead of multiplication. The fixed version behaves significantly better when running the trainer demo on MNIST. To get the results as below I also changed the learning rate to 0.001 and the beta2 parameter to 0.999 (from 0.01 and 0.99 respectively) as recommended in the paper.
Before:
After: