From 678724fcca5002313839e977847da5e24273775f Mon Sep 17 00:00:00 2001
From: Renu Singh <renu.singh@inria.fr>
Date: Wed, 22 Jan 2025 18:36:39 +0100
Subject: [PATCH] Document hydra args

---
 docs/user_guide/args.md     | 64 +++++++++++++++++++++++++++++++++++++
 docs/user_guide/evaluate.md |  5 ++-
 docs/user_guide/index.md    |  2 +-
 docs/user_guide/train.md    |  2 ++
 mkdocs.yaml                 |  1 +
 5 files changed, 70 insertions(+), 4 deletions(-)
 create mode 100644 docs/user_guide/args.md
diff --git a/docs/user_guide/args.md b/docs/user_guide/args.md
new file mode 100644
index 0000000..8584b68
--- /dev/null
+++ b/docs/user_guide/args.md
@@ -0,0 +1,64 @@
+# Hydra Parameters
+
+Full list of Hydra arguments that can be either modified in Hydra config files or overridden in CLI.
+
+CLI Usage:
+```sh
+python -m geoarches.main_hydra ++{arg_name}={arg_value}
+```
+
+The two arguments you absolutely should know are:
+
+1. `mode` chooses between running training (mode=`train`) or evaluation (mode=`test`).
+2. `name` is the unique run id. Must be updated for each new run. Tip: make it readable.
+
+## Pipeline args
+
+| arg_name                       | Default value        | Description  |
+| ------------------------------ | -------------------- | ------------ |
+| `mode`                         | 'train'              | `train` to run training ie. runs `LightningModule.fit()`<br/>`test` to run evaluation ie. runs`LightningModule.test()`|
+| `accumulate_grad_batches`      | 1                    | Accumulates gradients over k batches before stepping the optimizer. Used by [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#accumulate-grad-batches). |
+| `batchsize`                    | 1                    | Batch size of dataloaders for train, val, and test. |
+| `limit_train_batches`<br/>`limit_val_batches`<br/>`limit_test_batches` | Optional. | Limit batches loaded in dataloaders in [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#limit-train-batches). |
+| `log_freq`                     | 100                  | Frequency to log metrics. |
+| `max_steps`                    | 300000               | Max steps to run training. |
+| `save_step_frequency`          | 50000                | Save checkpoint every N steps. |
+| `seed`                         | 0                    | Seed lightning with `L.seed_everything(cfg.seed)` |
+
+## Args to save and load checkpoints
+
+| arg_name                       | Default value        | Description  |
+| ------------------------------ | -------------------- | ------------ |
+| `exp_dir`                      | 'modelstore/${name}' | During training, folder to store model checkpoints and hydra config used. If run already exists, pipeline will try to resume training instead.<br/>During evaluation, folder to load checkpoint and config from.<br/>By default, chooses latest checkpoint in dir (unless `ckpt_filename_match` specified). Recommendation: do not change this arg and change `name` for each new run. |
+| `name`                         | 'default-run'        | Default `exp_dir` will use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. This is also the Wandb run name. Unique display name, update every time you launch a new training run. |
+| `resume`                       | `True`               | Set `True` to resume training from a checkpoint when mode=`train`. |
+| `ckpt_filename_match`          | Optional             | Set to substring to match checkpoints files under `exp_dir`/checkpoints/ if resuming checkpoint to train or running evaluation. Pipeline will choose latest checkpoint under `exp_dir/checkpoints/` that contains the substring `ckpt_filename_match`. |
+| `load_ckpt`                    | Optional             | Path to load Pytorch lightning module checkpoint from but not resume run. Not compatible with `ckpt_filename_match`. Will load checkpoint, but not resume training. |
+
+## Logging args
+
+Currently only supports logging to WandB. See [User Guide](../user_guide/index.md#weights-and-biases-wandb) for more info.
+
+| arg_name                       | Default value        | Description  |
+| ------------------------------ | -------------------- | ------------ |
+| `log`                          | `False`              | Set `True` to log metrics. |
+| `cluster.wandb_mode`           | 'offline'            | `online` allows machine with internet connection to log directly to wandb.<br/> `offline` mode logs locally and requires a separate step to sync with wandb. |
+| `entity`                       | Optional             | WandB [entity](https://docs.wandb.ai/ref/python/init/). If not set, WandB assumes username. |
+| `project`                      | `False`              | WandB [project](https://docs.wandb.ai/ref/python/init/) to log run under. |
+| `name`                         | 'default-run'        | Wandb run [name](https://docs.wandb.ai/ref/python/init/). Unique display name, update every time you launch a new training run. Note: Default `exp_dir` will also use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. If run already exists, pipeline (if in `train` mode) will try to resume training from a checkpoint instead. |
+
+## Module args
+
+Check module class.
+
+## Dataloader args
+
+Check dataloader and backbone class.
+
+## Cluster args
+
+| arg_name                       | Default value        | Description  |
+| ------------------------------ | -------------------- | ------------ |
+| `cluster.cpus`                 | 1                    | Number of cpus running. Used for dataloader multi-threading. |
+| `cluster.precision`            | '16-mixed'           |  Lightning [precision](https://lightning.ai/docs/pytorch/stable/common/trainer.html#precision) |
+| `cluster.use_custom_requeue`   | `False`              |  Set `True` to handle job prematurely prempting on computing node. Before exiting, it will save checkpoint and re-enqueue node. |
\ No newline at end of file
diff --git a/docs/user_guide/evaluate.md b/docs/user_guide/evaluate.md
index 21a330a..d2bdaeb 100644
--- a/docs/user_guide/evaluate.md
+++ b/docs/user_guide/evaluate.md
@@ -19,9 +19,6 @@ python -m geoarches.main_hydra ++mode=test ++name=$MODEL \
 ++limit_test_batches=0.1 \ # run test on only a fraction of test set for debugging
 ++module.module.rollout_iterations=10 \ # autoregressive rollout horizon, in which case the line below is also needed
 ++dataloader.test_args.multistep=10 \ # allow the dataloader to load trajectories of size 10
-
-++dataloader.test_args.
-
 ```
 
 For testing the generative models, you can also use the following options:
@@ -31,6 +28,8 @@ For testing the generative models, you can also use the following options:
 ++module.inference.rollout_iterations=10 \ # number of auto-regressive steps, 10 days by default.
 ```
 
+See [Pipeline API](args.md) for full list of arguments.
+
 ## Compute model outputs and metrics separately
 
 You can compute model outputs and metrics separately. In that case, you first run evaluation as following:
diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md
index 5158800..d1db085 100644
--- a/docs/user_guide/index.md
+++ b/docs/user_guide/index.md
@@ -14,7 +14,7 @@ The main python script (`main_hydra.py` that runs a model pipeline), is pointed
 
 The config is constructed from the base config `configs/config.yaml` and is extended with configs under each folder such as `config/module/` and `config/dataloader/`.
 
-You can also override arguments by CLI (see [Arguments]() for useful arguments).
+You can also override arguments by CLI (see [Pipeline API](args.md) for full list of arguments).
 
 Example:
 ```sh
diff --git a/docs/user_guide/train.md b/docs/user_guide/train.md
index 2ee9b43..5f3c03b 100644
--- a/docs/user_guide/train.md
+++ b/docs/user_guide/train.md
@@ -21,6 +21,8 @@ python -m geoarches.main_hydra \
 ++max_steps=300000 \ # maximum number of steps for training, but it's good to leave this at 300k for era5 trainings
 ++save_step_frequency=50000 \ # if you need to save checkpoints at a higher frequency
 ```
+
+See [Pipeline API](args.md) for full list of arguments.
 ## Run on SLURM
 
 To run on a SLURM cluster, you can create a `configs/cluster` folder inside your working directory and put a ``custom_slurm.yaml`` configuration file in it with custom arguments. Then you call tell geoarches to use this configuration file with
diff --git a/mkdocs.yaml b/mkdocs.yaml
index eee7013..1dd84ca 100644
--- a/mkdocs.yaml
+++ b/mkdocs.yaml
@@ -39,6 +39,7 @@ nav:
       - Train: user_guide/train.md
       - Run and evaluate: user_guide/evaluate.md
       - Custom models: user_guide/custom_models.md
+      - Pipeline API: user_guide/args.md
   - Contributing:
       - Contribute to project: contributing/contribute.md
       - Report bug or feature request: contributing/bug.md