How pl synchronise objects and metrics when using DDP? #6334
Unanswered
heng-yuwen
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 4 replies
-
Metric will only sync after each step if you initialize them with |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to calculate the confusion matrix and accumulate them at each step. And after each epoch, I calculate the accuracy and other segmentation metrics at the end of each epoch. However, it seems that with DDP mode, the metrics are synchronised at each step, so after calling .compute(), the final value is multiplied by the number of processors. Here is a sample code that I use to create the ConfusionMatrix object and how to update the value:
It seems that the object self.confmat_valid is automatically synchronised at each step, so when calling compute, pl sums self.confmat_valid.confmat across multiple GPUs, but all the confmats are the same. Therefore, the final result is multiplied by the number of processors. Is this behaviour desired or did I made any mistakes?
By the way, I use ddp_cpu mode to debug.
Beta Was this translation helpful? Give feedback.
All reactions