A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

ASDen · 2022-12-28T12:04:25Z

Non-determinism of MaskRCNN

There have been a lot of discussions and inquiries in this repo about a fully deterministic MaskRCNN e.g. #4260, #3203 , #2615, #2480, and also on other detection repositories (e.g. MMDetection here and here and also torchvision here). Unfortunately, even after seeding everything and setting Pytorch's deterministic flags, results are still non-repeatable.

It boils down to the fact that some of the used Pytorch / torchvision ops doesn't have a deterministic GPU implementation (most notably, due to using atomicAdd in the backward pass). So, the only solution is to train for as long as possible to reduce variance in the results. It is worth noting that not only training, but also evaluation (see #2480) of MaskRCNN (and actually most detectron2 models) is not deterministic

Based on the minimal example in #4260, I made an analysis on the ops used for MaskRCNN and found that the main reason of non-determinism is the backward pass of ROIAlign (see here).

Proposed solution

I am here proposing a simple trick that makes ROIAlign practically fully reproducible, without touching the cuda kernel!! it introduces trivial additional memory and computation. It can be summarized as:

Truncate the input to a smaller datatype, this gives a starting point with a very small number of significand bits used
Then, cast to a larger data-type just before doing the computations that involve atomicAdd

In terms of code, this is translated to simply modifying this function call to

return roi_align(
    input.half().double(),
    rois.half().double(),
    self.output_size,
    self.spatial_scale,
    self.sampling_ratio,
    self.aligned,
).to(dtype=input.dtype)

Test

The conversion to double results in a trivial increase in memory & computation, but performing it after the truncation, significantly increases reproducibility.

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

MaskRCNN based on ResNet-50 bakbone
MaskRCNN based on ResNeXt-101 bakbone
Wide range of batch sizes
Mixed-precision training
Single and Multi-GPU training
A100's & V100's

Note on A100

Ampere by default uses TF32 format for tensor-core computations, which means that the above truncation is done implicitly! so on Ampere based devices it is enough just to cast to double, i.e.

return roi_align(
    input.double(),
    rois.double(),
    self.output_size,
    self.spatial_scale,
    self.sampling_ratio,
    self.aligned,
).to(dtype=input.dtype)

Note: This is the default mode for PyTorch, but if TF32 is disabled for some reason (i.e. torch.backends.cudnn.allow_tf32 = False) then the above truncation with .half() is still necessary

Note

This solution was tested and found to work well for other non-deterministic Pytorch ops, including: F.interpolate and F.grid_sample
This is not a general solution to the problem of random-order reproducible floating point summation, but a practical mitigation that works well for this setup / scenario
At least in theory, this should work even better if applied inside the kernel right before atomicAdd
The only alternative currently is training each experiment for very long, which isn't practical in many setups, and still isn't fully reproducible

Would love to hear what people think about this!
@ppwwyyxx @fmassa

The text was updated successfully, but these errors were encountered:

muncok · 2023-01-16T10:16:56Z

Thank you for sharing your tip.

I have tried your solution, but the loss values across runs were not identical.

You said

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

Was the losses values perfectly identical in your experiments?

Could you please let me know what version of Pytorch you are using?

LucQueen · 2023-01-29T07:27:07Z

Thank you @ASDen
but i tried your solution , i got a error when i train on A100

  warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/functional.py:599: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2299.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    launch(
  File "~/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net.py", line 150, in main
    return trainer.train()
  File "~/detectron2/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "~/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "~/detectron2/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "~/detectron2/detectron2/engine/train_loop.py", line 274, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 167, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 739, in forward
    losses = self._forward_box(features, proposals)
  File "~/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 798, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "~/detectron2/detectron2/modeling/poolers.py", line 261, in forward
    output.index_put_((inds,), pooler(x[level], pooler_fmt_boxes_level))
RuntimeError: expected scalar type Double but found Float

muncok · 2023-02-02T10:38:47Z

Thank you for sharing your tip.

I have tried your solution, but the loss values across runs were not identical.

You said

This solution was tested and found fully deterministic (losses values, and evaluation results on COCO) upto tens of thousands of steps (using same code as in #4260) for:

Was the losses values perfectly identical in your experiments?

Could you please let me know what version of Pytorch you are using?

Oh, my code starts to be reproducible after applying the following code snippet.

Thank you!!

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

GeoffreyChen777 · 2023-02-23T17:49:32Z

You are a real lifesaver.

luca-serra · 2023-11-14T09:56:54Z

Works like a charm!
The training is fully reproducible and the performance does not worsen.
Thanks for this hack!

katherinegls · 2024-01-05T03:57:46Z

Thank you @ASDen. I have tried your solution on the Sparse R-CNN based on the detectron2 on GPU 3090. Although the training is fully reproducible, the loss did not decrease. My modifications to RoI Align are as follows.
return roi_align( input.half().double(), rois.half().double(), self.output_size, self.spatial_scale, self.sampling_ratio, self.aligned, ).to(dtype=input.dtype)
How to solve the problem of loss not decreasing?

ASDen added the enhancement Improvements or good new features label Dec 28, 2022

GeoffreyChen777 mentioned this issue Feb 24, 2023

[Feature] A very simple but effective trick for deterministic issue in detection. open-mmlab/mmdetection#9831

Open

yshvrdhn mentioned this issue Mar 25, 2023

Deterministic training guidelines for different common workflows. pytorch/vision#7457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

ASDen commented Dec 28, 2022

muncok commented Jan 16, 2023 •

edited

Loading

LucQueen commented Jan 29, 2023

muncok commented Feb 2, 2023

GeoffreyChen777 commented Feb 23, 2023

luca-serra commented Nov 14, 2023

katherinegls commented Jan 5, 2024

A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

A simple trick for a fully deterministic ROIAlign, and thus MaskRCNN training and inference #4723

Comments

ASDen commented Dec 28, 2022

Non-determinism of MaskRCNN

Proposed solution

Test

Note on A100

Note

muncok commented Jan 16, 2023 • edited Loading

LucQueen commented Jan 29, 2023

muncok commented Feb 2, 2023

GeoffreyChen777 commented Feb 23, 2023

luca-serra commented Nov 14, 2023

katherinegls commented Jan 5, 2024

muncok commented Jan 16, 2023 •

edited

Loading