Error resuming training #152

peastman · 2022-11-18T18:00:57Z

I just encountered an error I've never seen before. I used the --load-model command line argument to resume training from a checkpoint. At first everything seemed to be working correctly, but after completing four epochs it exited with this error.

Traceback (most recent call last):
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 164, in <module>
    main()
  File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 160, in main
    trainer.test(model, data)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 936, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 983, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1222, in _run
    self._log_hyperparams()
  File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1277, in _log_hyperparams
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Error while merging hparams: the keys ['load_model'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.

The text was updated successfully, but these errors were encountered:

PhilippThoelke · 2022-11-18T20:51:21Z

It looks like training stopped after 4 epochs since the error occurred while calling trainer.test in the training script. After fit we load the best model checkpoint and then evaluate it on the test set. The best checkpoint that was loaded probably was saved in the previous training run? So the value of load_model is probably None, while the current DataModule contains a different value for load_value.

The problem in this specific case is probably something else since stopping training after 4 epochs was probably not intended? This error is definitely not very intuitive though.

A potential fix would be to just pass the test dataloader instead of the full DataModule.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error resuming training #152

Error resuming training #152

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 18, 2022 •

edited

Loading

Error resuming training #152

Error resuming training #152

Comments

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 18, 2022 • edited Loading

PhilippThoelke commented Nov 18, 2022 •

edited

Loading