From 678724fcca5002313839e977847da5e24273775f Mon Sep 17 00:00:00 2001 From: Renu Singh Date: Wed, 22 Jan 2025 18:36:39 +0100 Subject: [PATCH] Document hydra args --- docs/user_guide/args.md | 64 +++++++++++++++++++++++++++++++++++++ docs/user_guide/evaluate.md | 5 ++- docs/user_guide/index.md | 2 +- docs/user_guide/train.md | 2 ++ mkdocs.yaml | 1 + 5 files changed, 70 insertions(+), 4 deletions(-) create mode 100644 docs/user_guide/args.md diff --git a/docs/user_guide/args.md b/docs/user_guide/args.md new file mode 100644 index 0000000..8584b68 --- /dev/null +++ b/docs/user_guide/args.md @@ -0,0 +1,64 @@ +# Hydra Parameters + +Full list of Hydra arguments that can be either modified in Hydra config files or overridden in CLI. + +CLI Usage: +```sh +python -m geoarches.main_hydra ++{arg_name}={arg_value} +``` + +The two arguments you absolutely should know are: + +1. `mode` chooses between running training (mode=`train`) or evaluation (mode=`test`). +2. `name` is the unique run id. Must be updated for each new run. Tip: make it readable. + +## Pipeline args + +| arg_name | Default value | Description | +| ------------------------------ | -------------------- | ------------ | +| `mode` | 'train' | `train` to run training ie. runs `LightningModule.fit()`
`test` to run evaluation ie. runs`LightningModule.test()`| +| `accumulate_grad_batches` | 1 | Accumulates gradients over k batches before stepping the optimizer. Used by [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#accumulate-grad-batches). | +| `batchsize` | 1 | Batch size of dataloaders for train, val, and test. | +| `limit_train_batches`
`limit_val_batches`
`limit_test_batches` | Optional. | Limit batches loaded in dataloaders in [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#limit-train-batches). | +| `log_freq` | 100 | Frequency to log metrics. | +| `max_steps` | 300000 | Max steps to run training. | +| `save_step_frequency` | 50000 | Save checkpoint every N steps. | +| `seed` | 0 | Seed lightning with `L.seed_everything(cfg.seed)` | + +## Args to save and load checkpoints + +| arg_name | Default value | Description | +| ------------------------------ | -------------------- | ------------ | +| `exp_dir` | 'modelstore/${name}' | During training, folder to store model checkpoints and hydra config used. If run already exists, pipeline will try to resume training instead.
During evaluation, folder to load checkpoint and config from.
By default, chooses latest checkpoint in dir (unless `ckpt_filename_match` specified). Recommendation: do not change this arg and change `name` for each new run. | +| `name` | 'default-run' | Default `exp_dir` will use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. This is also the Wandb run name. Unique display name, update every time you launch a new training run. | +| `resume` | `True` | Set `True` to resume training from a checkpoint when mode=`train`. | +| `ckpt_filename_match` | Optional | Set to substring to match checkpoints files under `exp_dir`/checkpoints/ if resuming checkpoint to train or running evaluation. Pipeline will choose latest checkpoint under `exp_dir/checkpoints/` that contains the substring `ckpt_filename_match`. | +| `load_ckpt` | Optional | Path to load Pytorch lightning module checkpoint from but not resume run. Not compatible with `ckpt_filename_match`. Will load checkpoint, but not resume training. | + +## Logging args + +Currently only supports logging to WandB. See [User Guide](../user_guide/index.md#weights-and-biases-wandb) for more info. + +| arg_name | Default value | Description | +| ------------------------------ | -------------------- | ------------ | +| `log` | `False` | Set `True` to log metrics. | +| `cluster.wandb_mode` | 'offline' | `online` allows machine with internet connection to log directly to wandb.
`offline` mode logs locally and requires a separate step to sync with wandb. | +| `entity` | Optional | WandB [entity](https://docs.wandb.ai/ref/python/init/). If not set, WandB assumes username. | +| `project` | `False` | WandB [project](https://docs.wandb.ai/ref/python/init/) to log run under. | +| `name` | 'default-run' | Wandb run [name](https://docs.wandb.ai/ref/python/init/). Unique display name, update every time you launch a new training run. Note: Default `exp_dir` will also use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. If run already exists, pipeline (if in `train` mode) will try to resume training from a checkpoint instead. | + +## Module args + +Check module class. + +## Dataloader args + +Check dataloader and backbone class. + +## Cluster args + +| arg_name | Default value | Description | +| ------------------------------ | -------------------- | ------------ | +| `cluster.cpus` | 1 | Number of cpus running. Used for dataloader multi-threading. | +| `cluster.precision` | '16-mixed' | Lightning [precision](https://lightning.ai/docs/pytorch/stable/common/trainer.html#precision) | +| `cluster.use_custom_requeue` | `False` | Set `True` to handle job prematurely prempting on computing node. Before exiting, it will save checkpoint and re-enqueue node. | \ No newline at end of file diff --git a/docs/user_guide/evaluate.md b/docs/user_guide/evaluate.md index 21a330a..d2bdaeb 100644 --- a/docs/user_guide/evaluate.md +++ b/docs/user_guide/evaluate.md @@ -19,9 +19,6 @@ python -m geoarches.main_hydra ++mode=test ++name=$MODEL \ ++limit_test_batches=0.1 \ # run test on only a fraction of test set for debugging ++module.module.rollout_iterations=10 \ # autoregressive rollout horizon, in which case the line below is also needed ++dataloader.test_args.multistep=10 \ # allow the dataloader to load trajectories of size 10 - -++dataloader.test_args. - ``` For testing the generative models, you can also use the following options: @@ -31,6 +28,8 @@ For testing the generative models, you can also use the following options: ++module.inference.rollout_iterations=10 \ # number of auto-regressive steps, 10 days by default. ``` +See [Pipeline API](args.md) for full list of arguments. + ## Compute model outputs and metrics separately You can compute model outputs and metrics separately. In that case, you first run evaluation as following: diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md index 5158800..d1db085 100644 --- a/docs/user_guide/index.md +++ b/docs/user_guide/index.md @@ -14,7 +14,7 @@ The main python script (`main_hydra.py` that runs a model pipeline), is pointed The config is constructed from the base config `configs/config.yaml` and is extended with configs under each folder such as `config/module/` and `config/dataloader/`. -You can also override arguments by CLI (see [Arguments]() for useful arguments). +You can also override arguments by CLI (see [Pipeline API](args.md) for full list of arguments). Example: ```sh diff --git a/docs/user_guide/train.md b/docs/user_guide/train.md index 2ee9b43..5f3c03b 100644 --- a/docs/user_guide/train.md +++ b/docs/user_guide/train.md @@ -21,6 +21,8 @@ python -m geoarches.main_hydra \ ++max_steps=300000 \ # maximum number of steps for training, but it's good to leave this at 300k for era5 trainings ++save_step_frequency=50000 \ # if you need to save checkpoints at a higher frequency ``` + +See [Pipeline API](args.md) for full list of arguments. ## Run on SLURM To run on a SLURM cluster, you can create a `configs/cluster` folder inside your working directory and put a ``custom_slurm.yaml`` configuration file in it with custom arguments. Then you call tell geoarches to use this configuration file with diff --git a/mkdocs.yaml b/mkdocs.yaml index eee7013..1dd84ca 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -39,6 +39,7 @@ nav: - Train: user_guide/train.md - Run and evaluate: user_guide/evaluate.md - Custom models: user_guide/custom_models.md + - Pipeline API: user_guide/args.md - Contributing: - Contribute to project: contributing/contribute.md - Report bug or feature request: contributing/bug.md