Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics are in random order #255

Open
peastman opened this issue Jan 18, 2024 · 10 comments
Open

Metrics are in random order #255

peastman opened this issue Jan 18, 2024 · 10 comments

Comments

@peastman
Copy link
Collaborator

At some point, the metrics.csv file seems to have changed to putting the columns in a random order. Literally. As in, every training run puts them in a different order! This is confusing and makes the file hard to read.

How about sorting the columns alphabetically? That will give a consistent and logical order: epoch first, all the training losses grouped together, then all the validation losses grouped together in the same order as the training losses.

Bonus points if it can output the epoch as an integer rather than a floating point number.

@RaulPPelaez
Copy link
Collaborator

That is curious behavior indeed...
Lightning takes care of writing the metrics, all TMDNet does is calling log_dict from it:

self.log_dict(result_dict, sync_dist=True)

The fact that it is random makes me think of this: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations
But looks like an old behavior...

This is how CSVLogger writes metrics down the line:

    def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
        """Record metrics."""

        def _handle_value(value: Union[Tensor, Any]) -> Any:
            if isinstance(value, Tensor):
                return value.item()
            return value

        if step is None:
            step = len(self.metrics)

        metrics = {k: _handle_value(v) for k, v in metrics_dict.items()}
        metrics["step"] = step
        self.metrics.append(metrics)

    def save(self) -> None:
        """Save recorded metrics into files."""
        if not self.metrics:
            return

        new_keys = self._record_new_keys()
        file_exists = self._fs.isfile(self.metrics_file_path)

        if new_keys and file_exists:
            # we need to re-write the file if the keys (header) change
            self._rewrite_with_new_header(self.metrics_keys)

        with self._fs.open(self.metrics_file_path, mode=("a" if file_exists else "w"), newline="") as file:
            writer = csv.DictWriter(file, fieldnames=self.metrics_keys)
            if not file_exists:
                # only write the header if we're writing a fresh file
                writer.writeheader()
            writer.writerows(self.metrics)

        self.metrics = []  # reset

I do not see anything that would make this random after python 3.6.

AFAICT there is also nothing random in how LNNP constructs the metrics dict:

def on_validation_epoch_end(self):
if not self.trainer.sanity_checking:
# construct dict of logged metrics
result_dict = {
"epoch": float(self.current_epoch),
"lr": self.trainer.optimizers[0].param_groups[0]["lr"],
}
result_dict.update(self._get_mean_loss_dict_for_type("total"))
result_dict.update(self._get_mean_loss_dict_for_type("y"))
result_dict.update(self._get_mean_loss_dict_for_type("neg_dy"))
self.log_dict(result_dict, sync_dist=True)
self._reset_losses_dict()

@RaulPPelaez
Copy link
Collaborator

BTW I cannot think of a reason for epoch to be a float, lets change that... I will include it in #231

@RaulPPelaez
Copy link
Collaborator

Found this issue: Lightning-AI/pytorch-lightning#18978

@RaulPPelaez
Copy link
Collaborator

Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. Lightning-AI/pytorch-lightning#19159

@giadefa
Copy link
Contributor

giadefa commented Jan 19, 2024 via email

@RaulPPelaez
Copy link
Collaborator

The different losses and other metadata are packed into a dictionary at the end of each epoch and sent to log_dict from lightning.

@stefdoerr
Copy link
Collaborator

Dictionaries are sorted after python 3.7 by the insertion order so it's not the issue.
I guess the issue is that they are added to the dict in a random order.

@RaulPPelaez
Copy link
Collaborator

The issue is explained here: Lightning-AI/pytorch-lightning#18978
Is an internal bug in pytorch lightning

@peastman
Copy link
Collaborator Author

Ok, good to know.

@guillemsimeon
Copy link
Collaborator

Thanks, Raul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants