Logging issue when using custom metrics #6698

cemde · 2021-03-27T18:37:23Z

cemde
Mar 27, 2021

I am training a NN with pytorch lightning and would like to calculate the BrierScore at every step. I implemented this using the torchmetrics.Metric class. This breaks the logging.

With the Module, I initialise a dictionary self.metrics_dict in which I store the metric functions. It looks like this

'train':{'accuracy': Accuracy(), 'brier': BrierScore(), 'f1': F1()}
'val':{'accuracy': Accuracy(), 'brier': BrierScore(), 'f1': F1()}
'test':{'accuracy': Accuracy(), 'brier': BrierScore(), 'f1': F1()}

At the end of each step I call

metrics = {}
    train = (split == 'train')
    for metric_name, metric in self.metrics_dict[split].items():
        metrics[metric_name] = metric(y_pred, y_true)
        self.log(f"{split}/{metric_name}", metrics[metric_name], on_step = train, on_epoch = not train)

I have a ModelCheckpoint configured:

ModelCheckpoint(
        dirpath=dirpath,
        save_top_k=5,
        save_last=1,
        mode='max',
        monitor='val/accuracy',
        verbose=1,
        )

The ModelCheckpoint works fine when I only use metrics that come with the torchmetrics package, e.g.Accuracy and F1 (subset of the dict shown above). As soon as I add my own metric BrierScore (or any other custom for that matter), the ModelCheckpoint raises an Exception:

Exception has occurred: MisconfigurationException       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
ModelCheckpoint(monitor='val/accuracy') not found in the returned metrics: ['train/loss', 'train/accuracy', 'train/f1', 'train/brier']. HINT: Did you call self.log('val/accuracy', tensor) in the LightningModule?

(line 495 of p_l/callbacks/model_checkpoint.py)

My own metric works fine during debugging:

import torchmetrics
import torch
from torch import Tensor, tensor

class BrierScore(torchmetrics.Metric):
    def __init__(self):
        super().__init__()
        
        self.add_state("preds", default=Tensor(), dist_reduce_fx="cat")
        self.add_state("target", default=Tensor(), dist_reduce_fx="cat")
    
    def update(self, preds: Tensor, target: Tensor) -> None:
        self.preds = torch.cat([self.preds, preds.detach()])
        self.target = torch.cat([self.target, torch.nn.functional.one_hot(target)])

    def compute(self):
        score = (self.preds - self.target)**2
        score = score.sum()/score.shape[0]
        return score

The odd thing is that this error arrises after 5 training steps in epoch 0. Further, the metric functions work themselves.

What am I doing wrong?

Snippets:
Creating the self.metrics_dict

        splits = ["train", "val", "test"]
        accuracy_list = nn.ModuleList([Accuracy() for _ in splits])
        f1_list = nn.ModuleList([F1(num_classes=self.cfg.DATA.NUM_CLASSES) for _ in splits])
        brier_list = nn.ModuleList([BrierScore() for _ in splits])
        self.metrics_dict = {split : {} for split in splits}

        for split, acc, f1, brier, ece in zip(splits, accuracy_list, f1_list, brier_list, ece_list):
            self.metrics_dict[split] = {'accuracy': acc, 'f1' : f1, 'brier': brier, 'ece': ece}

kielnino · 2021-03-28T13:47:05Z

kielnino
Mar 28, 2021

Can you share the complete stacktrace? In my case the pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val_loss') not found in the returned metrics is often not the triggering exception, but rather a subsequent one. I.e. if something breaks before, the checkpoint tries to save, but does not find the metric. Therefore, the previous exception is often more informative.

1 reply

cemde Mar 28, 2021
Author

Thanks for your reply! Here is the full traceback:

Exception has occurred: MisconfigurationException       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
ModelCheckpoint(monitor='val/accuracy') not found in the returned metrics: ['train/loss', 'train/accuracy', 'train/f1', 'train/brier']. HINT: Did you call self.log('val/accuracy', tensor) in the LightningModule?
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 495, in _validate_monitor_key
    raise MisconfigurationException(m)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 247, in save_checkpoint
    self._validate_monitor_key(trainer)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
    cb.on_validation_end(self.trainer, model)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
    self.check_checkpoint_callback(should_update=True, is_last=True)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_train
    self.train_loop.on_train_end()
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/Users/username/ProjectFolder/train.py", line 61, in <module>
    trainer.fit(network, datamodule=datasetmodule)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/username/opt/anaconda3/envs/torchCPU/lib/python3.8/runpy.py", line 194, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging issue when using custom metrics #6698

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Logging issue when using custom metrics #6698

cemde Mar 27, 2021

Replies: 1 comment · 1 reply

kielnino Mar 28, 2021

cemde Mar 28, 2021 Author

cemde
Mar 27, 2021

Replies: 1 comment 1 reply

kielnino
Mar 28, 2021

cemde Mar 28, 2021
Author