Some CNN-based single-stage object detectors significantly change spatial locations of (salient) features, while moving from the input image space towards the feature map space. In particular, in the intermediate feature maps, activations might be spatially shifted (in many cases towards the center of the object).
Image below highlights salient features in the input image space (middle column) and in the feature map space (right column). Salient features are the ones that models use to detect a bbox of a 'person' class.
See more models and saliency maps in examples.
Such kind of behaviour is not the case for CNN-based classification architectures (Resnet, MobileNet, etc.). That is the reason of well-developed CAM-based Explainable AI (XAI) methods for classifiers. On the other side, many object detectors, while being designed to precisely estimate location of the objects, actually mess up spatial location of object features in the latent space.
Considering examples, it is possible to conclude that the following models (mostly) shift activations towards the center of the object:
While the following models mostly tend to preserve spacial location of the activations (although not in all cases):
XAI can be used to estimate which part of input (which features) makes the most contribution to the model prediction. To visualize the most salient features, I applied D-RISE to the input image and to the feature map (activation tensor). See implementation results here. Similar results might be obtained from the different approach - when visualizing normalized per-class slices of the raw classification head output (if available).
Due to the loss design. Only cells located in the proximity to the center of the object are getting gradient signal - IOU(target, prediction) is estimated, see iou_loss implementation in mmdetection. Therefore, the model explicitly learn to move features to the center of the object. This is less of an issue for e.g. two-stage detectors (Faster R-CNN) or RetinaNet, see examples.
It can limit anything that leverages internal network activations to recover spatial insights, e.g.:
- Activation-based XAI methods, which are somehow using feature maps, cannot always be directly used to explain object detectors. Due to fact that feature map space might not well preserve spatial information.
- Obtaining class activation map in YOLO paper (see Fig. 2 at YOLO paper).