Skip to content
Dustin Tran edited this page Jun 30, 2015 · 53 revisions

The learning rate is crucial for the convergence and efficiency of stochastic gradient descent. There are two types of solutions:

(1) θ_n = θ_{n-1} + α_n ∇l(θ_{n-1}; yn)

where α_n is a scalar.

(2) θ_n = θ_{n-1} + D_n ∇l(θ_{n-1}; yn)

where D_n is a matrix.

By the theory from truncating the Taylor expansion at second-order (Newton's method), and also natural gradients (Amari, 1998), α_n and D_n should approximate the inverse of the Fisher information at the current iterate θ_{n-1}:

I(θ_{n-1})^{-1} = E[∇l(θ_{n-1}; yn)*(∇l(θ_{n-1}; yn))^T]^{-1}

one-dim

We adopt the generalized formulation mentioned in Xu (2011) with scalar hyperparameters γ, α, c, scale:

α_n = scale * γ/(1 + αγn)^(-c)

The defaults are scale=1, γ=1, α is the minimum eigenvalue of the covariance of the data's design matrix, and c=1 if not averaging, c=2/3 if averaging.

Note that as special cases, this includes the popular learning rates α_n = scale/n and α_n = scale. how?

one-dim-eigen

We build a scalar approximation to the Fisher information based on an upper bound of the minimum eigenvalue of the observed Fisher information:

α_n = 1/(λ_min * n)

λ_min is bounded by (and thus substituted by) the basic expression

λ_min ≤ sum(I_hat)/d
I_hat = ∇l(θ_{n-1}; yn)^2

The trace of the observed Fisher information I_hat, i.e., the sum of the diagonal entries of ∇l(θ_{n-1}; yn)[∇l(θ_{n-1}; yn)]^T, can be efficiently calculated by taking the sum of the component-wise squares of ∇l(θ_{n-1}; yn).

d-dim

We build a d-parameter (diagonal) matrix approximation to the Fisher information based on the diagonal of the observed Fisher information:

I_hat = ∇l(θ_{n-1}; yn)^2
D_n = 1/(I_hat + ε)

We save computation by noting that the diagonal of the observed Fisher information, i.e., the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T, is equivalent to taking the component-wise squares of ∇l(θ_{n-1}; yn). The defaults are a value ε=1e-6 to prevent division by zero and other numerical stability reasons.

adagrad

We follow Duchi et al. (2011). The idea is to perform a diagonal approximation to the Fisher information based on including previous history of the Fisher information with the new one.

I_hat_new = ∇l(θ_{n-1}; yn)^2
I_hat = I_hat + I_hat_new
D_n =  η/(I_hat + ε)^(1/2)

The observed Fisher information I_hat is the sum of two terms: I_hat from the previous iteration, and I_hat_new which is the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T. The defaults are a step size η=1 and value ε=1e-6 to prevent division by zero and other numerical stability reasons.

rmsprop

We follow Tieleman and Hinton (2012). The idea is to make Adagrad less aggressive in order to account for cases when a very noisy gradient may too strongly affect previous history. It takes a moving average of previous history of the Fisher information with the new one.

I_hat_new = ∇l(θ_{n-1}; yn)^2
I_hat = γ * I_hat + (1 - γ) * I_hat_new
D_n = η/(I_hat + ε)^(1/2)

The observed Fisher information I_hat is the weighted average of two terms: I_hat from the previous iteration, and I_hat_new which is the diagonal of ∇l(θ_{n-1}; yn)∇l(θ_{n-1}; yn)^T. The defaults are a step size η=1, decay rate γ=0.9 which determines how much to weight previous information, and value ε=1e-6 to prevent division by zero and other numerical stability reasons.

adam

We follow Kingma and Ba (2015). The idea is to combine nice properties from both Adagrad and rmsprop.

k-fac

We follow Martens and Grosse (2015). The idea is to efficiently approximate the inverse Fisher information based on block diagonals.

Recommended resources

Clone this wiki locally