Merge pull request #337 from kozistr/feature/psgd-optimizer

[Feature] implement PSGD Kron optimizer
kozistr · Feb 1, 2025 · 4b439ab · 4b439ab
2 parents c7496b0 + 68a02f8
commit 4b439ab
Show file tree

Hide file tree

Showing 22 changed files with 759 additions and 92 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@
 
 ## The reasons why you use `pytorch-optimizer`.
 
-* Wide range of supported optimizers. Currently, **93 optimizers (+ `bitsandbytes`, `qgalore`, `torchao`)**, **16 lr schedulers**, and **13 loss functions** are supported!
+* Wide range of supported optimizers. Currently, **94 optimizers (+ `bitsandbytes`, `qgalore`, `torchao`)**, **16 lr schedulers**, and **13 loss functions** are supported!
 * Including many variants such as `ADOPT`, `Cautious`, `AdamD`, `StableAdamW`, and `Gradient Centrailiaztion`
 * Easy to use, clean, and tested codes
 * Active maintenance
@@ -201,6 +201,7 @@ get_supported_optimizers(['adam*', 'ranger*'])
 | SPAM          | *Spike-Aware Adam with Momentum Reset for Stable LLM Training*                                    | [github](https://github.com/TianjinYellow/SPAM-Optimizer)                                                      | <https://arxiv.org/abs/2501.06842>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2025arXiv250106842H/exportcitation)                                                        |
 | TAM           | *Torque-Aware Momentum*                                                                           |                                                                                                                | <https://arxiv.org/abs/2412.18790>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2024arXiv241218790M/exportcitation)                                                        |
 | FOCUS         | *First Order Concentrated Updating Scheme*                                                        | [github](https://github.com/liuyz0/FOCUS)                                                                      | <https://arxiv.org/abs/2501.12243>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2025arXiv250112243M/exportcitation)                                                        |
+| PSGD          | *Preconditioned Stochastic Gradient Descent*                                                      | [github](https://github.com/lixilinx/psgd_torch)                                                               | <https://arxiv.org/abs/1512.04202>                                                          | [cite](https://github.com/lixilinx/psgd_torch?tab=readme-ov-file#resources)                                                         |
 
 ## Supported LR Scheduler
 

diff --git a/docs/changelogs/v3.3.5.md b/docs/changelogs/v3.3.5.md
diff --git a/docs/changelogs/v3.4.0.md b/docs/changelogs/v3.4.0.md
@@ -0,0 +1,25 @@
+### Change Log
+
+### Feature
+
+* Implement `FOCUS` optimizer. (#330, #331)
+    * [First Order Concentrated Updating Scheme](https://arxiv.org/abs/2501.12243) 
+* Implement `PSGD Kron`. (#337)
+    * [preconditioned stochastic gradient descent w/ Kron pre-conditioner](https://arxiv.org/abs/1512.04202) 
+
+### Update
+
+* Support `OrthoGrad` variant to `Ranger25`. (#332)
+  * `Ranger25` optimizer is my experimental-crafted optimizer, which mixes lots of optimizer variants such as `ADOPT` + `AdEMAMix` + `Cautious` + `StableAdamW` + `Adam-Atan2` + `OrthoGrad`.
+
+### Fix
+
+* Add the missing `state` property in `OrthoGrad` optimizer. (#326, #327)
+* Add the missing `state_dict`, and `load_state_dict` methods to `TRAC` and `OrthoGrad` optimizers. (#332)
+* Skip when the gradient is sparse in `OrthoGrad` optimizer. (#332)
+* Support alternative precision training in `SOAP` optimizer. (#333)
+* Store SOAP condition matrices as the dtype of their parameters. (#335)
+
+### Contributions
+
+thanks to @Vectorrent, @kylevedder
diff --git a/docs/index.md b/docs/index.md
@@ -10,7 +10,7 @@
 
 ## The reasons why you use `pytorch-optimizer`.
 
-* Wide range of supported optimizers. Currently, **93 optimizers (+ `bitsandbytes`, `qgalore`, `torchao`)**, **16 lr schedulers**, and **13 loss functions** are supported!
+* Wide range of supported optimizers. Currently, **94 optimizers (+ `bitsandbytes`, `qgalore`, `torchao`)**, **16 lr schedulers**, and **13 loss functions** are supported!
 * Including many variants such as `ADOPT`, `Cautious`, `AdamD`, `StableAdamW`, and `Gradient Centrailiaztion`
 * Easy to use, clean, and tested codes
 * Active maintenance
@@ -201,6 +201,7 @@ get_supported_optimizers(['adam*', 'ranger*'])
 | SPAM          | *Spike-Aware Adam with Momentum Reset for Stable LLM Training*                                    | [github](https://github.com/TianjinYellow/SPAM-Optimizer)                                                      | <https://arxiv.org/abs/2501.06842>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2025arXiv250106842H/exportcitation)                                                        |
 | TAM           | *Torque-Aware Momentum*                                                                           |                                                                                                                | <https://arxiv.org/abs/2412.18790>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2024arXiv241218790M/exportcitation)                                                        |
 | FOCUS         | *First Order Concentrated Updating Scheme*                                                        | [github](https://github.com/liuyz0/FOCUS)                                                                      | <https://arxiv.org/abs/2501.12243>                                                          | [cite](https://ui.adsabs.harvard.edu/abs/2025arXiv250112243M/exportcitation)                                                        |
+| PSGD          | *Preconditioned Stochastic Gradient Descent*                                                      | [github](https://github.com/lixilinx/psgd_torch)                                                               | <https://arxiv.org/abs/1512.04202>                                                          | [cite](https://github.com/lixilinx/psgd_torch?tab=readme-ov-file#resources)                                                         |
 
 ## Supported LR Scheduler
 

diff --git a/docs/optimizer.md b/docs/optimizer.md
@@ -284,6 +284,10 @@
     :docstring:
     :members:
 
+::: pytorch_optimizer.Kron
+    :docstring:
+    :members:
+
 ::: pytorch_optimizer.QHAdam
     :docstring:
     :members: