In self-navigation problems for autonomous vehicles, the variability of environmental conditions, complex scenes with vehicles and pedestrians, and the high-dimensional or real-time nature of tasks make segmentation challenging. Sensor fusion can representatively improve performances. Thus, this work highlights a late fusion concept used for semantic segmentation tasks in such perception systems. It is based on two approaches for merging information coming from two neural networks, one trained for camera data and one for LiDAR frames. The first approach involves fusing probabilities along with calculating partial conflicts and redistributing data. The second technique focuses on making individual decisions based on sources and fusing them later with weighted Shannon entropies. The two segmentation models are trained and evaluated on a particular KITTI semantic dataset. In the realm of multi-class segmentation tasks, the two fusion techniques are compared and evaluated with illustrative examples. Intersection over union metric and quality of decision are computed to assess the performance of each methodology.
This repository represents some approaches of fusing two identical segmentation models. They are both convolutional neural network based, inspired from a cross-fusion model. One represents the neural network architecture that is trained with camera images, and the same one is used to learns features from dense map Lidar data.
Code to be uploaded once the work is recognized representative and the writing advances in the publishing proccedure.
The second approach works by making some decisions based on the Bayesian output of the architectures, considering entropies thereafter to check how consistent the information is. Suppose that for a camera model, a pixel ((i,j)), is considered with the following mass values for each class:
[m1(R) = 0.80, m1(V) = 0.15, m1(B) = 0.05]
In this situation, taking the decision for pixel ((i,j) = R) (from the camera model) can be relevant, but not 100% sure because m1(R) < 1. Similarly, for a LiDAR frame, suppose a pixel with mass values accordingly:
[m2(R) = 0.55, m2(V) = 0.25, m2(B) = 0.20]
The decision will be the same, pixel ((i,j) = R), which can again be relevant, but the decision tends to be riskier as m2(R) is just above 0.5. Instead of fusing probabilities directly, another way is to fuse weighted decisions by their quality, calculated from entropy.
In the previous example, based on m1, the early state of the decision will represent (R) (road) class for the camera segmentation model:
[ md1(R) = 1, md1(V) = 0, md1(B) = 0]
Then, according to the weight, the decision will be updated. The weight of source 1 for this pixel is calculated by the quality measure as:
w1 = 1 - H(m1)/Hmax
where H(m1) is the entropy of m1 because m1 is Bayesian. (For a more general (non-probabilistic) context when working with non-Bayesian BBAs, we could use the generalized entropy for belief functions defined in DezertEntropy.) Therefore, H(m1) corresponds to Shannon entropy, while Hmax is the maximum of Shannon entropy obtained for a uniform probability mass function.
Based on m2, the (R) class will be decided. Therefore, the judgment based on LiDAR data is:
[md2(R) = 1, md2(V) = 0, md2(B) = 0]
with the weight of source 2 (LiDAR) provided by the quality:
w2 = 1 - H(m2)/Hmax
The decisions are fused by a simple weighted averaging rule as follows:
md(R) = (w1 / (w1 + w2)) * md1(R) + (w2 / (w1 + w2)) * md2(R) md(V) = (w1 / (w1 + w2)) * md1(V) + (w2 / (w1 + w2)) * md2(V) md(B) = (w1 / (w1 + w2)) * md1(B) + (w2 / (w1 + w2)) * md2(B)
In this simple example, Theta = 3 since the frame of discernment (FoD) has three singletons only. Therefore, w1 will have a greater value than w2 due to the lower entropy of H(m1). Consequently, the camera source shows greater confidence.