The project created in a two-person team. Implementation of RetinaNet architecture in a face detection task. Training and evaluation on a Wider dataset.
- project documentation: https://github.com/jkwiatk1/retinanet-face-detection/blob/main/data/Dokumentacja_ko%C5%84cowa_GSN.pdf
- result presentation: https://github.com/jkwiatk1/retinanet-face-detection/blob/main/data/GSN%20-%20Etap%203.pptx
Backbone created using the PyTorch model with pre-trained weights.
Created based on:
Low resolution feature maps capture more global information of the image and represent richer semantic meaning while the high resolution feature maps focus more on the local information and provide more accurate spatial information. The goal of FPN is to combine the high and low resolution feature maps to enhance the features with both accurate spatial information and rich semantic meaning. FPN extracts feature maps and later feeds into a detector, like RPN.
RPN applies a sliding window over the feature maps to make predictions on the objectness (has an object or not) and the object boundary box at each location.
For each scale level (say P4), a 3 × 3 convolution filter is applied over the feature maps followed by separate 1 × 1 convolution for objectness predictions and boundary box regression. These 3 × 3 and 1 × 1 convolutional layers are called the RPN head. The same head is applied to all different scale levels of feature maps.
Classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes.
The subnet is a FCN which applies four 3×3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3×3 conv layer with KA filters. (K classes, A=9 anchors, and C = 256 filters)
Regression subnet is a FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists.
It is identical to the classification subnet except that it terminates in 4A linear outputs per spatial location.
It is a class-agnostic bounding box regressor which uses fewer parameters, which is found to be equally effective.
During training, the total focal loss of an image is computed as the sum of the focal loss over all 100k anchors, normalized by the number of anchors assigned to a ground-truth box.