-
Notifications
You must be signed in to change notification settings - Fork 19
Learning Rate
The learning rate is crucial for the convergence and efficiency of stochastic gradient descent. There are two types of solutions:
(1) θ_n = θ_{n-1} + α_n ∇l(θ_{n-1}; yn)
where α_n
is a scalar.
(2) θ_n = θ_{n-1} + D_n ∇l(θ_{n-1}; yn)
where D_n
is a matrix.
By the theory from truncating the Taylor expansion at second-order (Newton's method), and also
natural gradients (Amari, 1998), α_n
and D_n
should approximate the inverse of the Fisher information at the current iterate θ_{n-1}
:
I(θ_{n-1})^{-1} = E[∇l(θ_{n-1}; yn)*(∇l(θ_{n-1}; yn))^T]^{-1}
We adopt the generalized formulation mentioned in Xu (2011) with scalar hyperparameters γ
, α
, c
, scale
:
α_n = scale * γ/(1 + αγn)^(-c)
The defaults are scale=1
, γ=1
, α
is the minimum eigenvalue of the covariance of the data's design matrix, and c=1
if not averaging, c=2/3
if averaging.
Note that as special cases, this includes the popular learning rates α_n = scale/n
and α_n = scale
. how?
We build a scalar approximation to the Fisher information based on an upper bound of the minimum eigenvalue of the observed Fisher information:
α_n = 1/(λ_min * n)
λ_min
is bounded by (and thus substituted by) the basic expression
λ_min ≤ sum(I_hat)/d
I_hat = ∇l(θ_{n-1}; yn)^2
The trace of the observed Fisher information I_hat
, i.e., the sum of the diagonal entries of ∇l(θ_{n-1}; yn)[∇l(θ_{n-1}; yn)]^T
, can be efficiently calculated by taking the sum of the component-wise squares of ∇l(θ_{n-1}; yn)
.
We build a d
-parameter (diagonal) matrix approximation to the Fisher information based on the diagonal of the observed Fisher information:
I_hat = ∇l(θ_{n-1}; yn)^2
D_n = 1/(I_hat + ε)
We save computation by noting that the diagonal of the observed Fisher information, i.e., the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T
, is equivalent to taking the component-wise squares of ∇l(θ_{n-1}; yn)
. The defaults are a value ε=1e-6
to prevent division by zero and other numerical stability reasons.
We follow Duchi et al. (2011). The idea is to perform a diagonal approximation to the Fisher information based on including previous history of the Fisher information with the new one.
I_hat_new = ∇l(θ_{n-1}; yn)^2
I_hat = I_hat + I_hat_new
D_n = η/(I_hat + ε)^(1/2)
The observed Fisher information I_hat
is the sum of two terms: I_hat
from the previous iteration, and I_hat_new
which is the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T
. The defaults are a step size η=1
and value ε=1e-6
to prevent division by zero and other numerical stability reasons.
We follow Tieleman and Hinton (2012). The idea is to make Adagrad less aggressive in order to account for cases when a very noisy gradient may too strongly affect previous history. It takes a moving average of previous history of the Fisher information with the new one.
I_hat_new = ∇l(θ_{n-1}; yn)^2
I_hat = γ * I_hat + (1 - γ) * I_hat_new
D_n = η/(I_hat + ε)^(1/2)
The observed Fisher information I_hat
is the weighted average of two terms: I_hat
from the previous iteration, and I_hat_new
which is the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T
. The defaults are a step size η=1
, decay rate γ=0.9
which determines how much to weight previous information, and value ε=1e-6
to prevent division by zero and other numerical stability reasons.
We follow Kingma and Ba (2015). The idea is to combine nice properties from both Adagrad and rmsprop.
We follow Martens and Grosse (2015). The idea is to efficiently approximate the inverse Fisher information based on block diagonals.
- Course notes for Stanford's CS231n Convolutional Neural Networks for Visual Recognition by Andrej Karpathy.
- Advances in optimizing Recurrent Networks by Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu.
- Notes by Colin Raffel