Metrics are in random order #255

peastman · 2024-01-18T21:09:53Z

At some point, the metrics.csv file seems to have changed to putting the columns in a random order. Literally. As in, every training run puts them in a different order! This is confusing and makes the file hard to read.

How about sorting the columns alphabetically? That will give a consistent and logical order: epoch first, all the training losses grouped together, then all the validation losses grouped together in the same order as the training losses.

Bonus points if it can output the epoch as an integer rather than a floating point number.

RaulPPelaez · 2024-01-19T10:15:04Z

That is curious behavior indeed...
Lightning takes care of writing the metrics, all TMDNet does is calling log_dict from it:

torchmd-net/torchmdnet/module.py

Line 229 in 2c2b5f0

self.log_dict(result_dict, sync_dist=True)

The fact that it is random makes me think of this: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations
But looks like an old behavior...

This is how CSVLogger writes metrics down the line:

    def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
        """Record metrics."""

        def _handle_value(value: Union[Tensor, Any]) -> Any:
            if isinstance(value, Tensor):
                return value.item()
            return value

        if step is None:
            step = len(self.metrics)

        metrics = {k: _handle_value(v) for k, v in metrics_dict.items()}
        metrics["step"] = step
        self.metrics.append(metrics)

    def save(self) -> None:
        """Save recorded metrics into files."""
        if not self.metrics:
            return

        new_keys = self._record_new_keys()
        file_exists = self._fs.isfile(self.metrics_file_path)

        if new_keys and file_exists:
            # we need to re-write the file if the keys (header) change
            self._rewrite_with_new_header(self.metrics_keys)

        with self._fs.open(self.metrics_file_path, mode=("a" if file_exists else "w"), newline="") as file:
            writer = csv.DictWriter(file, fieldnames=self.metrics_keys)
            if not file_exists:
                # only write the header if we're writing a fresh file
                writer.writeheader()
            writer.writerows(self.metrics)

        self.metrics = []  # reset

I do not see anything that would make this random after python 3.6.

AFAICT there is also nothing random in how LNNP constructs the metrics dict:

torchmd-net/torchmdnet/module.py

Lines 219 to 231 in 2c2b5f0

    
           def on_validation_epoch_end(self): 
        
               if not self.trainer.sanity_checking: 
        
                   # construct dict of logged metrics 
        
                   result_dict = { 
        
                       "epoch": float(self.current_epoch), 
        
                       "lr": self.trainer.optimizers[0].param_groups[0]["lr"], 
        
                   } 
        
                   result_dict.update(self._get_mean_loss_dict_for_type("total")) 
        
                   result_dict.update(self._get_mean_loss_dict_for_type("y")) 
        
                   result_dict.update(self._get_mean_loss_dict_for_type("neg_dy")) 
        
                   self.log_dict(result_dict, sync_dist=True) 
        
               self._reset_losses_dict()

RaulPPelaez · 2024-01-19T10:16:04Z

BTW I cannot think of a reason for epoch to be a float, lets change that... I will include it in #231

RaulPPelaez · 2024-01-19T10:17:40Z

Found this issue: Lightning-AI/pytorch-lightning#18978

RaulPPelaez · 2024-01-19T10:20:37Z

Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. Lightning-AI/pytorch-lightning#19159

giadefa · 2024-01-19T10:45:37Z

Are you storing the data in a dictionary?

…

On Fri, Jan 19, 2024 at 11:20 AM Raul ***@***.***> wrote: Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. Lightning-AI/pytorch-lightning#19159 <Lightning-AI/pytorch-lightning#19159> — Reply to this email directly, view it on GitHub <#255 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOXKS56V52OTRLPYKIDYPJCIDAVCNFSM6AAAAABCA533POVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGEZTGNJWGA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

RaulPPelaez · 2024-01-19T10:47:45Z

The different losses and other metadata are packed into a dictionary at the end of each epoch and sent to log_dict from lightning.

stefdoerr · 2024-01-19T10:48:47Z

Dictionaries are sorted after python 3.7 by the insertion order so it's not the issue.
I guess the issue is that they are added to the dict in a random order.

RaulPPelaez · 2024-01-19T10:49:49Z

The issue is explained here: Lightning-AI/pytorch-lightning#18978
Is an internal bug in pytorch lightning

peastman · 2024-01-19T17:08:46Z

Ok, good to know.

guillemsimeon · 2024-01-19T17:10:03Z

Thanks, Raul

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics are in random order #255

Metrics are in random order #255

peastman commented Jan 18, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

giadefa commented Jan 19, 2024 via email

RaulPPelaez commented Jan 19, 2024

stefdoerr commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

peastman commented Jan 19, 2024

guillemsimeon commented Jan 19, 2024

Metrics are in random order #255

Metrics are in random order #255

Comments

peastman commented Jan 18, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

giadefa commented Jan 19, 2024 via email

RaulPPelaez commented Jan 19, 2024

stefdoerr commented Jan 19, 2024

RaulPPelaez commented Jan 19, 2024

peastman commented Jan 19, 2024

guillemsimeon commented Jan 19, 2024