Clarifying how log
with sync_dist
and on_epoch
from the training_step
works?
#20123
Unanswered
golmschenk
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 2 replies
-
If you want to compute some matrix using DDP strategy, you can use the torchmetric and implement a metric yourself. It can reduce all when you call compute() |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, the documentation about
log
doesn't seem to specify quite what happens with a couple combinations of settings. Notably, I'm interested in calculating a metric on each step (insidetraining_step
), with the values reduced at the end of the epoch (usingon_step=False, on_epoch=True
). However, when running with DDP, I would like to also reduce across the DDP group. But, for performance, I only want this reduction to happen at the end of the epoch, not on each step. If I setsync_dist
does the sync happen on each step before accumulating? Or does it happen during the reduction at the end of the epoch since this is when the logging occurs? If it happens on each step, is there a good builtin a way to have this sync happen on the values which were already accumulated? To note, using theon_train_epoch_end
function will not work well in my standard case, as a single epoch would will contain billions of values if not accumulated during the training step and kept as separate values. Thank you for your time!Beta Was this translation helpful? Give feedback.
All reactions