You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -46,20 +47,13 @@ <h2>Multimodal Transformer for Material Segmentation</h2>
46
47
<divclass="section">
47
48
<h2>Abstract</h2>
48
49
<hr>
49
-
<p>Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different combinations of four different modalities: RGB, Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR). We also propose a new model named Multi-Modal Segmentation Transformer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material segmentation. MMSFormer achieves 52.05% mIoU outperforming the current state-of-the-art on Multimodal Material Segmentation (MCubeS) dataset. For instance, our method provides significant improvement in detecting gravel (+10.4%) and human (+9.1%) classes. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials.</p>
50
+
<p>Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials.</p>
<pclass="text-left"><b>Figure 1:</b> Overall architecture of the proposed MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation maps.</p>
<pclass="text-left"><b>Figure 2:</b> Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through MLP layer to fuse them. Then a mixer layer captures and mixes multi-scale features using parallel convolutions and MLP layers. We use Squeeze and Excitation block as channel attention in the residual connection.</p>
<pclass="text-left"><b>Figure 1:</b> a) Overall architecture of MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation map. (b) Illustration of the mix transformer block. Each block applies a spatial reduction before applying multi-head attention to reduce computational cost. (c) Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through linear fusion layer to fuse them. Then the feature tensor is fed to linear projection and parallel convolution layers to capture multi-scale features. We use Squeeze and Excitation block [28] as channel attention in the residual connection to dynamically re-calibrate the features along the channel dimension.</p>
63
57
</div>
64
58
</div>
65
59
@@ -76,22 +70,36 @@ <h2>Comparison with Current State of the Art Models</h2>
<pclass="text-left"><b>Table 1:</b> Performance comparison on MCubeS dataset. Here A, D and N stand for Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR) respectively.</p>
<pclass="text-left"><b>Table 1:</b> performance comparison on FMB (left) and MCubeS (right) datasets. Here A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively.</p>
<pclass="text-left"><b>Figure 2:</b> Visualization of predictions on MCubeS and PST900 datasets. Figure 2(a) shows RGB and all modalities (RGB-A-D-N) prediction from CMNeXt and our model on MCubeS dataset. For brevity, we only show the RGB image and ground truth material segmentation maps along with the predictions. Figure 2(b) shows predictions from RTFNet, FDCNet and our model for RGB-thermal input modalities on PST900 dataset. Our model shows better predictions on both of the datasets.</p>
<pclass="text-left"><b>Table 2:</b> Per-class % IoU comparison on MCubeS dataset. Our proposed MMSFormer model shows better performance in detecting most of the classes compared to the current state-of-the-art models. ∗ indicates that the code and pretrained model from the authors were used to generate the results.</p>
<pclass="text-left"><b>Figure 3:</b>Visualization of results for RGB and all modalities (RGB-A-D-N) prediction with CMNeXt and our proposed MMSFormer. For brevity, we only show the RGB image and Ground Truth material segmentation maps. Our model provides overall better results and correctly identifies asphalt, gravel, and road markings as indicated in the rectangular bounding boxes.</p>
<pclass="text-left"><b>Table 3:</b>Per-class % IoU comparison on FMB dataset for both RGB only and RGB-infrared modalities. We show the comparison for 8 classes (out of 14) that are published. T-Lamp and T-Sign stand for Traffic Lamp and Traffic Sign respectively. Our model outperforms all the methods for all the classes except for the truck class.</p>
<pclass="text-left"><b>Table 2:</b>Per class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
<pclass="text-left"><b>Table 3:</b>Per class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
<pclass="text-left"><b>Table 5:</b>er class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
<pclass="text-left"><b>Figure 4:</b>Visualization of results for prediction using different modality combinations of our proposed MMSFormer model. Detection of concrete, gravel and road markings become more accurate as we add more modalities as shown in the rectangular bounding boxes.</p>
<pclass="text-left"><b>Figure 3:</b>The figure below shows the visualization of predicted segmentation maps for different modality combinations on MCubeS and FMB datasets. Both figures show that prediction accuracy increases as we incrementally add new modalities. They also illustrate the fusion block’s ability to effectively combine information from different modality combinations.</p>
113
121
</div>
114
122
</div>
115
123
</div>
@@ -133,7 +141,7 @@ <h2>Bibtex</h2>
133
141
<hr>
134
142
<divclass="bibtexsection">
135
143
@misc{reza2023multimodal,
136
-
title={Multimodal Transformer for Material Segmentation},
144
+
title={MMSFormer: Multimodal Transformer for Material and Semantic Segmentation},
137
145
author={Md Kaykobad Reza and Ashley Prater-Bennette and M. Salman Asif},
0 commit comments