Feature drift of object detection models

Some CNN-based single-stage object detectors significantly change spatial locations of (salient) features, while moving from the input image space towards the feature map space. In particular, in the intermediate feature maps, activations might be spatially shifted (in many cases towards the center of the object).

Image below highlights salient features in the input image space (middle column) and in the feature map space (right column). Salient features are the ones that models use to detect a bbox of a 'person' class.

See more models and saliency maps in examples.

Such kind of behaviour is not the case for CNN-based classification architectures (Resnet, MobileNet, etc.). That is the reason of well-developed CAM-based Explainable AI (XAI) methods for classifiers. On the other side, many object detectors, while being designed to precisely estimate location of the objects, actually mess up spatial location of object features in the latent space.

Which object detectors shift features?

Considering examples, it is possible to conclude that the following models (mostly) shift activations towards the center of the object:

While the following models mostly tend to preserve spacial location of the activations (although not in all cases):

Experiment methodology

XAI can be used to estimate which part of input (which features) makes the most contribution to the model prediction. To visualize the most salient features, I applied D-RISE to the input image and to the feature map (activation tensor). See implementation results here. Similar results might be obtained from the different approach - when visualizing normalized per-class slices of the raw classification head output (if available).

Why this is happening?

Due to the loss design. Only cells located in the proximity to the center of the object are getting gradient signal - IOU(target, prediction) is estimated, see iou_loss implementation in mmdetection. Therefore, the model explicitly learn to move features to the center of the object. This is less of an issue for e.g. two-stage detectors (Faster R-CNN) or RetinaNet, see examples.

Which limitations does it bring?

It can limit anything that leverages internal network activations to recover spatial insights, e.g.:

Activation-based XAI methods, which are somehow using feature maps, cannot always be directly used to explain object detectors. Due to fact that feature map space might not well preserve spatial information.
Obtaining class activation map in YOLO paper (see Fig. 2 at YOLO paper).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
saliency_maps		saliency_maps
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature drift of object detection models

Which object detectors shift features?

Experiment methodology

Why this is happening?

Which limitations does it bring?

About

Releases

Packages

Languages

License

negvet/feature_drift

Folders and files

Latest commit

History

Repository files navigation

Feature drift of object detection models

Which object detectors shift features?

Experiment methodology

Why this is happening?

Which limitations does it bring?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages