-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723
Comments
Thank you for sharing your tip. I have tried your solution, but the loss values across runs were not identical. You said
Was the Could you please let me know what version of Pytorch you are using? |
Thank you @ASDen
|
Oh, my code starts to be reproducible after applying the following code snippet. Thank you!! import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True) |
You are a real lifesaver. |
Works like a charm! |
Thank you @ASDen. I have tried your solution on the Sparse R-CNN based on the detectron2 on GPU 3090. Although the training is fully reproducible, the loss did not decrease. My modifications to RoI Align are as follows. |
Non-determinism of MaskRCNN
There have been a lot of discussions and inquiries in this repo about a fully deterministic MaskRCNN e.g. #4260, #3203 , #2615, #2480, and also on other detection repositories (e.g. MMDetection here and here and also torchvision here). Unfortunately, even after seeding everything and setting Pytorch's deterministic flags, results are still non-repeatable.
It boils down to the fact that some of the used Pytorch / torchvision ops doesn't have a deterministic GPU implementation (most notably, due to using
atomicAdd
in the backward pass). So, the only solution is to train for as long as possible to reduce variance in the results. It is worth noting that not only training, but also evaluation (see #2480) of MaskRCNN (and actually most detectron2 models) is not deterministicBased on the minimal example in #4260, I made an analysis on the ops used for MaskRCNN and found that the main reason of non-determinism is the backward pass of
ROIAlign
(see here).Proposed solution
I am here proposing a simple trick that makes
ROIAlign
practically fully reproducible, without touching the cuda kernel!! it introduces trivial additional memory and computation. It can be summarized as:atomicAdd
In terms of code, this is translated to simply modifying this function call to
Test
The conversion to
double
results in a trivial increase in memory & computation, but performing it after the truncation, significantly increases reproducibility.This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:
Note on A100
Ampere by default uses TF32 format for tensor-core computations, which means that the above truncation is done implicitly! so on Ampere based devices it is enough just to cast to double, i.e.
Note: This is the default mode for PyTorch, but if TF32 is disabled for some reason (i.e.
torch.backends.cudnn.allow_tf32 = False
) then the above truncation with.half()
is still necessaryNote
F.interpolate
andF.grid_sample
atomicAdd
Would love to hear what people think about this!
@ppwwyyxx @fmassa
The text was updated successfully, but these errors were encountered: