The old PSGD implementation is deprecated. The new PSGD implementation is a superset of the old one, and further supports three more matmul-only/tri-solver-free methods for updating
PSGD (Preconditioned SGD) is a general purpose (mathematical and stochastic, convex and nonconvex) 2nd order optimizer. It reformulates a wide range of preconditioner estimation and Hessian fitting problems as a family of strongly convex Lie group optimization problems.
Notations:
The PSGD theory has two orthogonal parts: criteria for preconditioner fitting and preconditioner fitting in Lie groups.
PSGD was originally designed for preconditioning the gradient such that metrics of the spaces of preconditioned gradient and parameters are matched, i.e.,
The above preconditioner fitting criteria are always convex in the Euclidean space, the manifold of SPD matrices and the Lie groups. But, they are strongly convex only in the Lie groups ref. The
Criterion | Solution | Notes |
---|---|---|
Reduces to secant equation |
||
Reduces to Newton's method when |
||
|
||
Relates to the AdaGrad family, e.g., Adam(W), RMSProp, Shampoo, |
Note 1:
Table II: Lie group ($dQ=EQ$ ) preconditioners with storage and computation numbers for $\theta={\rm vec}(\Theta)$ with $\Theta\in\mathbb{R}^{m\times m}$
Lie Group | Update of |
Storages | Computations | Class |
---|---|---|---|---|
DenseNewton | ||||
Tri matrices | DenseNewton | |||
LRAWhiten/Newton | ||||
|
KronWhiten/Newton | |||
|
KronWhiten/Newton | |||
|
|
LRAWhiten/Newton | ||
same as kron | KronWhiten/Newton |
Note 1: The other three more matmul-only (tri-solver-free) preconditioner update methods have similar forms and complexities (ref to be updated).
Note 2: For the gradient/momentum whitening preconditioner, we simply replace pair
This script generates the following plot showing the typical behaviors of different Hessian fitting methods.
- With a static and noise-free Hessian-vector product model, both BFGS and PSGD converge linearly to the optimal preconditioner while closed-form solution
$P=\left(E[hh^T]\right)^{-0.5}$ only converges sublinearly with rate$\mathcal{O}(1/t)$ . - With a static additive noisy Hessian-vector model
$h=Hv+\epsilon$ , BFGS diverges easily. With a constant step size$\mu$ , the steady-state fitting errors of PSGD are proportional to$\mu$ . - With a time-varying Hessian
$H_{t+1}=H_t + uu^T$ and$u\sim\mathcal{U}(0,1)$ , PSGD locks onto good preconditioner estimations quicker than BFGS without a divergence stage. The closed-form solution$P=\left(E[hh^T]\right)^{-0.5}$ is not good at tracking due to its sublinear rate of convergence.
Optimizers with the criteria in Table I and preconditioner forms in Table II are wrapped into classes KronWhiten/Newton, LRAWhiten/Newton and DenseNewton for easy use.
Three main differences from torch.optim.SGD:
- The loss to be minimized is passed through as a closure to the optimizer to support more dynamic behaviors, notably, Hessian-vector product approximation with finite difference method when the 2nd order derivatives are unavailable. The closure should return a loss or an iterator with its first element as the loss.
- Momentum here is the moving average of gradient so that its setting is decoupled from the learning rate, which is always normalized in PSGD.
- As any other regularizations, (coupled) weight decay should be explicitly realized by adding an
$L2$ regularization to the loss. Similarly, decoupled weight decay is not included inside the PSGD implementations.
A few more details. The Hessian-vector products are calculated as a vector-jacobian-product (vjp), i.e.,
There are plenty of demos: Rosenbrock function minimization, vision transformer, generative pre-trained transformer, logistic regression, tensor rank decomposition, etc.. For this tiny vision transformer demo, the following results show that all the four PSGD-Kron-gradient-whitening preconditioners can improve the convergence a lot compared with Adam(W).
- Preconditioned stochastic gradient descent, arXiv:1512.04202, 2015. (General ideas of PSGD, preconditioner fitting criteria and Kronecker product preconditioners.)
- Preconditioner on matrix Lie group for SGD, arXiv:1809.10232, 2018. (Focus on affine Lie group preconditioners, including feature normalization or whitening (per batch or layer) as special affine preconditioners. Use PSGD for gradient whitening.)
- Black box Lie group preconditioners for SGD, arXiv:2211.04422, 2022. (Mainly about the LRA preconditioner. I also have prepared these supplementary materials for detailed math derivations.)
- Stochastic Hessian fittings with Lie groups, arXiv:2402.11858, 2024 (to be updated). (Convergence properties of PSGD, also a good summary of PSGD. The Hessian fitting problem is convex in the Euclidean space and the manifold of SPD matrices, but it's strongly convex only in the quotient set
${\rm GL}(n, \mathbb{R})/R_{\rm polar}$ , or group${\rm GL}(n, \mathbb{R})$ if we don't care$Q$ 's rotation ambiguity.) - Curvature-informed SGD via general purpose Lie-group preconditioners, arXiv:2402.04553, 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.)
- There are a few more efficient and specialized PSGD implementations: Evan's JAX and Torch versions, Lucas' Heavyball. Also my outdated and unmaintained Tensorflow code: TF 1.x and TF 2.x.