Change of bin loss computation to avoid learning from empty annotations. #1011
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR: This fix leads to better performance in rotation prediction.
First of all, thank you very much for publicating your work. It was super helpful for my work as well. In my work I found a slight hickup in the loss function for the roation bin classification. It is minor in code but it has a rather big impact on the convergence and overall performance of the network for predicting rotation of objects and therefore, at inference, also in predicting velocities.
I was using the nuScenes dataset so I can't (without high effort) be sure how it looks like for the KITTI dataset. Although, in the GenericDataset class there is the parameter max_objs for both datasets. For the nuScenes dataset this parameter states that there can be at most max_objs number of annotations per image. If there are less than this number of annotations present in the image the rest is simply filled up by default labeling (basically zeros for every parameter). For simplicity, let's name this rest as "placeholder annotations". This entire concept is completely fine if those "placeholder annotations" are not used for anything. Except of course, to not (!) predict anything because it would not make sense for the network to always predict the maximum number of objects. Although, this principle of not predicting objects is already trained for in the heatmap head.
I propose removing these placeholder annotations, or to be more concise, removing the indices from the output and target tensors where the mask tensor is zero. This would have the same effect as masking the output with zeros in all other loss functions but for the entropy loss it is different. When we mask the output of the rotation bin classification we basically say we are 0% certain that the angle is either in bin 1 or not in bin 1. But since the target by default is 0 (which means not in bin 1) the backward pass optimizes the parameters for classifying not being in bin 1. Thus we do not (!) ignore the placeholder annotations when we mask the output. In most other loss functions, for example L1Loss
Loss = |pred - target| = 0 - 0 = 0,
masking the output has the effect of ignoring the placeholder annotations, but not so in the entropy loss since
e^0 != 0.
Also, in other loss functions that share this problem as for example the WeightedBCELoss you masked the unreduced loss instead of the output which has the same effect as removing the indices.