Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error resuming training #152

Open
peastman opened this issue Nov 18, 2022 · 1 comment
Open

Error resuming training #152

peastman opened this issue Nov 18, 2022 · 1 comment

Comments

@peastman
Copy link
Collaborator

I just encountered an error I've never seen before. I used the --load-model command line argument to resume training from a checkpoint. At first everything seemed to be working correctly, but after completing four epochs it exited with this error.

Traceback (most recent call last):
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 164, in <module>
    main()
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 160, in main
    trainer.test(model, data)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 936, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 983, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1222, in _run
    self._log_hyperparams()
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1277, in _log_hyperparams
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Error while merging hparams: the keys ['load_model'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.
@PhilippThoelke
Copy link
Collaborator

PhilippThoelke commented Nov 18, 2022

It looks like training stopped after 4 epochs since the error occurred while calling trainer.test in the training script. After fit we load the best model checkpoint and then evaluate it on the test set. The best checkpoint that was loaded probably was saved in the previous training run? So the value of load_model is probably None, while the current DataModule contains a different value for load_value.

The problem in this specific case is probably something else since stopping training after 4 epochs was probably not intended? This error is definitely not very intuitive though.

A potential fix would be to just pass the test dataloader instead of the full DataModule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants