Skip to content

Commit

Permalink
Merge pull request #127 from raimis/cli
Browse files Browse the repository at this point in the history
Implement CLI
  • Loading branch information
Raimondas Galvelis committed Oct 3, 2022
2 parents 554d45f + 2995f78 commit c1c4fcf
Show file tree
Hide file tree
Showing 6 changed files with 6 additions and 8 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ url={https://openreview.net/forum?id=zNHzqZ9wrRB}
Specifying training arguments can either be done via a configuration yaml file or through command line arguments directly. An example configuration file for a TorchMD Graph Network can be found in [examples/](https://github.com/compsciencelab/torchmd-net/blob/main/examples). For an example on how to train the network on the QM9 dataset, see [examples/](https://github.com/compsciencelab/torchmd-net/blob/main/examples). GPUs can be selected by their index by listing the device IDs (coming from `nvidia-smi`) in the `CUDA_VISIBLE_DEVICES` environment variable. Otherwise, the argument `--ngpus` can be used to select the number of GPUs to train on (-1 uses all available GPUs or the ones specified in `CUDA_VISIBLE_DEVICES`).
```
mkdir output
CUDA_VISIBLE_DEVICES=0 python torchmd-net/scripts/train.py --conf torchmd-net/examples/ET-QM9.yaml --log-dir output/
CUDA_VISIBLE_DEVICES=0 tmn-train --conf torchmd-net/examples/ET-QM9.yaml --log-dir output/
```

## Pretrained models
Expand All @@ -60,7 +60,7 @@ As an example, have a look at `torchmdnet.priors.Atomref`.

## Multi-Node Training

In order to train models on multiple nodes some environment variables have to be set, which provide all necessary information to PyTorch Lightning. In the following we provide an example bash script to start training on two machines with two GPUs each. The script has to be started once on each node. Once [`train.py`](https://github.com/compsciencelab/torchmd-net/blob/main/scripts/train.py) is started on all nodes, a network connection between the nodes will be established using NCCL.
In order to train models on multiple nodes some environment variables have to be set, which provide all necessary information to PyTorch Lightning. In the following we provide an example bash script to start training on two machines with two GPUs each. The script has to be started once on each node. Once `tmn-train` is started on all nodes, a network connection between the nodes will be established using NCCL.

In addition to the environment variables the argument `--num-nodes` has to be specified with the number of nodes involved during training.

Expand All @@ -70,7 +70,7 @@ export MASTER_ADDR=hostname1
export MASTER_PORT=12910
mkdir -p output
CUDA_VISIBLE_DEVICES=0,1 python torchmd-net/scripts/train.py --conf torchmd-net/examples/ET-QM9.yaml.yaml --num-nodes 2 --log-dir output/
CUDA_VISIBLE_DEVICES=0,1 tmn-train --conf torchmd-net/examples/ET-QM9.yaml.yaml --num-nodes 2 --log-dir output/
```

- `NODE_RANK` : Integer indicating the node index. Must be `0` for the main node and incremented by one for each additional node.
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Training
We provide three example config files for the ET for training on QM9, MD17 and ANI1 respectively. To train on a QM9 target other than `energy_U0`, change the parameter `dataset_arg` in the QM9 config file. Changing the MD17 molecule to train on works analogously. To train an ET from scratch you can use the following code from the torchmd-net directory:
```bash
CUDA_VISIBLE_DEVICES=0,1 python scripts/train.py --conf examples/ET-{QM9,MD17,ANI1}.yaml
CUDA_VISIBLE_DEVICES=0,1 tmn-train --conf examples/ET-{QM9,MD17,ANI1}.yaml
```
Use the `CUDA_VISIBLE_DEVICES` environment variable to select which and how many GPUs you want to train on. The example above selects GPUs with indices 0 and 1. The training code will want to save checkpoints and config files in a directory called `logs/`, which you can change either in the config .yaml file or as an additional command line argument: `--log-dir path/to/log-dir`.

Expand Down
1 change: 1 addition & 0 deletions scripts
5 changes: 1 addition & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,9 @@
print("Failed to retrieve the current version, defaulting to 0")
version = "0"

with open("requirements.txt") as f:
requirements = f.read().splitlines()

setup(
name="torchmd-net",
version=version,
packages=find_packages(),
install_requires=requirements,
entry_points={"console_scripts": ["tmn-train = torchmdnet.scripts.train:main"]},
)
File renamed without changes.
File renamed without changes.

0 comments on commit c1c4fcf

Please sign in to comment.