Skip to content

Commit

Permalink
Document hydra args
Browse files Browse the repository at this point in the history
  • Loading branch information
14renus committed Jan 23, 2025
1 parent 39d21bb commit 678724f
Show file tree
Hide file tree
Showing 5 changed files with 70 additions and 4 deletions.
64 changes: 64 additions & 0 deletions docs/user_guide/args.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Hydra Parameters

Full list of Hydra arguments that can be either modified in Hydra config files or overridden in CLI.

CLI Usage:
```sh
python -m geoarches.main_hydra ++{arg_name}={arg_value}
```

The two arguments you absolutely should know are:

1. `mode` chooses between running training (mode=`train`) or evaluation (mode=`test`).
2. `name` is the unique run id. Must be updated for each new run. Tip: make it readable.

## Pipeline args

| arg_name | Default value | Description |
| ------------------------------ | -------------------- | ------------ |
| `mode` | 'train' | `train` to run training ie. runs `LightningModule.fit()`<br/>`test` to run evaluation ie. runs`LightningModule.test()`|
| `accumulate_grad_batches` | 1 | Accumulates gradients over k batches before stepping the optimizer. Used by [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#accumulate-grad-batches). |
| `batchsize` | 1 | Batch size of dataloaders for train, val, and test. |
| `limit_train_batches`<br/>`limit_val_batches`<br/>`limit_test_batches` | Optional. | Limit batches loaded in dataloaders in [Lightning API](https://lightning.ai/docs/pytorch/stable/common/trainer.html#limit-train-batches). |
| `log_freq` | 100 | Frequency to log metrics. |
| `max_steps` | 300000 | Max steps to run training. |
| `save_step_frequency` | 50000 | Save checkpoint every N steps. |
| `seed` | 0 | Seed lightning with `L.seed_everything(cfg.seed)` |

## Args to save and load checkpoints

| arg_name | Default value | Description |
| ------------------------------ | -------------------- | ------------ |
| `exp_dir` | 'modelstore/${name}' | During training, folder to store model checkpoints and hydra config used. If run already exists, pipeline will try to resume training instead.<br/>During evaluation, folder to load checkpoint and config from.<br/>By default, chooses latest checkpoint in dir (unless `ckpt_filename_match` specified). Recommendation: do not change this arg and change `name` for each new run. |
| `name` | 'default-run' | Default `exp_dir` will use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. This is also the Wandb run name. Unique display name, update every time you launch a new training run. |
| `resume` | `True` | Set `True` to resume training from a checkpoint when mode=`train`. |
| `ckpt_filename_match` | Optional | Set to substring to match checkpoints files under `exp_dir`/checkpoints/ if resuming checkpoint to train or running evaluation. Pipeline will choose latest checkpoint under `exp_dir/checkpoints/` that contains the substring `ckpt_filename_match`. |
| `load_ckpt` | Optional | Path to load Pytorch lightning module checkpoint from but not resume run. Not compatible with `ckpt_filename_match`. Will load checkpoint, but not resume training. |

## Logging args

Currently only supports logging to WandB. See [User Guide](../user_guide/index.md#weights-and-biases-wandb) for more info.

| arg_name | Default value | Description |
| ------------------------------ | -------------------- | ------------ |
| `log` | `False` | Set `True` to log metrics. |
| `cluster.wandb_mode` | 'offline' | `online` allows machine with internet connection to log directly to wandb.<br/> `offline` mode logs locally and requires a separate step to sync with wandb. |
| `entity` | Optional | WandB [entity](https://docs.wandb.ai/ref/python/init/). If not set, WandB assumes username. |
| `project` | `False` | WandB [project](https://docs.wandb.ai/ref/python/init/) to log run under. |
| `name` | 'default-run' | Wandb run [name](https://docs.wandb.ai/ref/python/init/). Unique display name, update every time you launch a new training run. Note: Default `exp_dir` will also use `name` to set checkpoint folder to `modelstore/${name}/checkpoints/`. If run already exists, pipeline (if in `train` mode) will try to resume training from a checkpoint instead. |

## Module args

Check module class.

## Dataloader args

Check dataloader and backbone class.

## Cluster args

| arg_name | Default value | Description |
| ------------------------------ | -------------------- | ------------ |
| `cluster.cpus` | 1 | Number of cpus running. Used for dataloader multi-threading. |
| `cluster.precision` | '16-mixed' | Lightning [precision](https://lightning.ai/docs/pytorch/stable/common/trainer.html#precision) |
| `cluster.use_custom_requeue` | `False` | Set `True` to handle job prematurely prempting on computing node. Before exiting, it will save checkpoint and re-enqueue node. |
5 changes: 2 additions & 3 deletions docs/user_guide/evaluate.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,6 @@ python -m geoarches.main_hydra ++mode=test ++name=$MODEL \
++limit_test_batches=0.1 \ # run test on only a fraction of test set for debugging
++module.module.rollout_iterations=10 \ # autoregressive rollout horizon, in which case the line below is also needed
++dataloader.test_args.multistep=10 \ # allow the dataloader to load trajectories of size 10

++dataloader.test_args.

```

For testing the generative models, you can also use the following options:
Expand All @@ -31,6 +28,8 @@ For testing the generative models, you can also use the following options:
++module.inference.rollout_iterations=10 \ # number of auto-regressive steps, 10 days by default.
```

See [Pipeline API](args.md) for full list of arguments.

## Compute model outputs and metrics separately

You can compute model outputs and metrics separately. In that case, you first run evaluation as following:
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The main python script (`main_hydra.py` that runs a model pipeline), is pointed

The config is constructed from the base config `configs/config.yaml` and is extended with configs under each folder such as `config/module/` and `config/dataloader/`.

You can also override arguments by CLI (see [Arguments]() for useful arguments).
You can also override arguments by CLI (see [Pipeline API](args.md) for full list of arguments).

Example:
```sh
Expand Down
2 changes: 2 additions & 0 deletions docs/user_guide/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ python -m geoarches.main_hydra \
++max_steps=300000 \ # maximum number of steps for training, but it's good to leave this at 300k for era5 trainings
++save_step_frequency=50000 \ # if you need to save checkpoints at a higher frequency
```

See [Pipeline API](args.md) for full list of arguments.
## Run on SLURM

To run on a SLURM cluster, you can create a `configs/cluster` folder inside your working directory and put a ``custom_slurm.yaml`` configuration file in it with custom arguments. Then you call tell geoarches to use this configuration file with
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ nav:
- Train: user_guide/train.md
- Run and evaluate: user_guide/evaluate.md
- Custom models: user_guide/custom_models.md
- Pipeline API: user_guide/args.md
- Contributing:
- Contribute to project: contributing/contribute.md
- Report bug or feature request: contributing/bug.md
Expand Down

0 comments on commit 678724f

Please sign in to comment.