Clarifying how `log` with `sync_dist` and `on_epoch` from the `training_step` works? #20123

golmschenk · 2024-07-24T04:00:00Z

golmschenk
Jul 24, 2024

Hello, the documentation about log doesn't seem to specify quite what happens with a couple combinations of settings. Notably, I'm interested in calculating a metric on each step (inside training_step), with the values reduced at the end of the epoch (using on_step=False, on_epoch=True). However, when running with DDP, I would like to also reduce across the DDP group. But, for performance, I only want this reduction to happen at the end of the epoch, not on each step. If I set sync_dist does the sync happen on each step before accumulating? Or does it happen during the reduction at the end of the epoch since this is when the logging occurs? If it happens on each step, is there a good builtin a way to have this sync happen on the values which were already accumulated? To note, using the on_train_epoch_end function will not work well in my standard case, as a single epoch would will contain billions of values if not accumulated during the training step and kept as separate values. Thank you for your time!

huangfu170 · 2024-07-29T07:57:07Z

huangfu170
Jul 29, 2024

If you want to compute some matrix using DDP strategy, you can use the torchmetric and implement a metric yourself. It can reduce all when you call compute()

2 replies

golmschenk Jul 29, 2024
Author

Hi @huangfu170, I know it's possible to manually write the accumulation and compute steps, but then I would lose some of the other automation and consistency provided by Lightning's logging interface. Just to confirm, are you saying that sync_dist will cause it to sync on every step, and so that's why you're recommending this alternative?

huangfu170 Aug 7, 2024

emmm in fact, the on_step and on_epoch seems have no effect to 'sync_dist', sync_dist means that the final metric you want to log is mean(or other reduce_fx) of the value of different GPUs(in DDP, i.e., different process). I recommand you to use torchmetric as alternative is because if you only to calculate some metric that "compute in one GPU is equal to avg/sum of multi GPUs" like loss, it is all OK. but when it comes to some metrics that need to gather some outputs and compute then(like output of a classification model to compute F1), it cannot automatically collect all outputs in different GPUs by set the sync_dist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifying how `log` with `sync_dist` and `on_epoch` from the `training_step` works? #20123

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Clarifying how log with sync_dist and on_epoch from the training_step works? #20123

golmschenk Jul 24, 2024

Replies: 1 comment · 2 replies

huangfu170 Jul 29, 2024

golmschenk Jul 29, 2024 Author

huangfu170 Aug 7, 2024

Clarifying how `log` with `sync_dist` and `on_epoch` from the `training_step` works? #20123

golmschenk
Jul 24, 2024

Replies: 1 comment 2 replies

huangfu170
Jul 29, 2024

golmschenk Jul 29, 2024
Author