-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated training not deterministic despite identical setup and reproducibility flags #4260
Comments
I am facing a very similar issue. Did you find a reason for this behaviour and have any suggestions how to fix it? |
I'm still facing the issue. Without having debugged this in more detail and just looking at the losses of the three runs, There have been other issues that have been closed in the past (e.g. #2480 |
Are there any news or advice on possible reasons for this issue? |
Maybe I missed something in your setup but I suspect you need to find a way to ensure the dataloader works in a deterministic way as well |
Also depending on the model you use, you might face:
|
Hi, I'm working on an experiment where I noticed large differences between models trained with identical configs and random seeds. I'm trying to understand the causes for this.
I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions:
https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility
However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP).
These differences occur in multiple runs on the same machine (identical device, code, config, random seed).
I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.
Instructions To Reproduce the Issue:
script to reproduce the experiment (
deterministic_example.py
)run2:
run3:
Expected behavior:
I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.
The text was updated successfully, but these errors were encountered: