Skip to content

Commit 3ccafe0

Browse files
committed
Update MMSformer
1 parent 4371067 commit 3ccafe0

22 files changed

+35
-27
lines changed

MMSFormer/.DS_Store

6 KB
Binary file not shown.

MMSFormer/img/.DS_Store

6 KB
Binary file not shown.

MMSFormer/img/FMB-performance-min.png

102 KB
Loading

MMSFormer/img/MMSFormer-min.png

189 KB
Loading

MMSFormer/img/PST-performance-min.png

148 KB
Loading

MMSFormer/img/fmb-mcubes.jpeg

219 KB
Loading

MMSFormer/img/fmb-sota-min.png

45.9 KB
Loading
Loading
48.5 KB
Loading

MMSFormer/img/mcubes-sota-min.png

41.7 KB
Loading
Loading
343 KB
Loading

MMSFormer/index.html

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44
<meta charset="utf-8">
55
<meta http-equiv="X-UA-Compatible" content="IE=edge">
66
<meta name="viewport" content="width=device-width, initial-scale=1">
7-
<meta name="description" content="Multimodal Transformer for Material Segmentation">
7+
<meta name="description" content="MMSFormer: Multimodal Transformer for Material and Semantic Segmentation">
88
<meta name="author" content="Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif">
99

10-
<title>Multimodal Transformer for Material Segmentation</title>
10+
<title>MMSFormer: Multimodal Transformer for Material and Semantic Segmentation</title>
1111
<!-- Bootstrap core CSS -->
1212
<!--link href="bootstrap.min.css" rel="stylesheet"-->
1313
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"
@@ -22,8 +22,9 @@
2222

2323
<div class="jumbotron jumbotron-fluid">
2424
<div class="container"></div>
25-
<h2>Multimodal Transformer for Material Segmentation</h2>
26-
<p class="abstract"><b>Performing multimodal material segmentation with transformer</b></p>
25+
<h2>MMSFormer: Multimodal Transformer for </h2>
26+
<h2>Material and Semantic Segmentation</h2>
27+
<p class="abstract"><b>Performing multimodal material and semantic segmentation with transformer</b></p>
2728
<hr>
2829
<p class="authors">
2930
<a href="https://kaykobad.github.io/" target="_blank">Md Kaykobad Reza<sup> 1</sup></a>,
@@ -46,20 +47,13 @@ <h2>Multimodal Transformer for Material Segmentation</h2>
4647
<div class="section">
4748
<h2>Abstract</h2>
4849
<hr>
49-
<p>Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different combinations of four different modalities: RGB, Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR). We also propose a new model named Multi-Modal Segmentation Transformer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material segmentation. MMSFormer achieves 52.05% mIoU outperforming the current state-of-the-art on Multimodal Material Segmentation (MCubeS) dataset. For instance, our method provides significant improvement in detecting gravel (+10.4%) and human (+9.1%) classes. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials.</p>
50+
<p>Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials.</p>
5051
<br>
5152

5253
<div class="row">
5354
<div class="col text-center">
54-
<img src="./img/MMSFormer-Overall-2.png" style="width:75%" alt="Banner">
55-
<p class="text-left"><b>Figure 1:</b> Overall architecture of the proposed MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation maps.</p>
56-
</div>
57-
</div>
58-
59-
<div class="row">
60-
<div class="col text-center">
61-
<img src="./img/MMSFormer-Fusion.png" style="width:70%" alt="Banner">
62-
<p class="text-left"><b>Figure 2:</b> Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through MLP layer to fuse them. Then a mixer layer captures and mixes multi-scale features using parallel convolutions and MLP layers. We use Squeeze and Excitation block as channel attention in the residual connection.</p>
55+
<img src="./img/MMSFormer-min.png" style="width:85%" alt="Banner">
56+
<p class="text-left"><b>Figure 1:</b> a) Overall architecture of MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation map. (b) Illustration of the mix transformer block. Each block applies a spatial reduction before applying multi-head attention to reduce computational cost. (c) Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through linear fusion layer to fuse them. Then the feature tensor is fed to linear projection and parallel convolution layers to capture multi-scale features. We use Squeeze and Excitation block [28] as channel attention in the residual connection to dynamically re-calibrate the features along the channel dimension.</p>
6357
</div>
6458
</div>
6559

@@ -76,22 +70,36 @@ <h2>Comparison with Current State of the Art Models</h2>
7670

7771
<div class="row">
7872
<div class="col text-center">
79-
<img src="./img/comparison-with-sota.png" style="width:50%" alt="Banner">
80-
<p class="text-left"><b>Table 1:</b> Performance comparison on MCubeS dataset. Here A, D and N stand for Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR) respectively.</p>
73+
<img src="./img/fmb-mcubes.jpeg" style="width:90%" alt="Banner">
74+
<p class="text-left"><b>Table 1:</b> performance comparison on FMB (left) and MCubeS (right) datasets. Here A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively.</p>
75+
</div>
76+
</div>
77+
78+
<div class="row">
79+
<div class="col text-center">
80+
<img src="./img/visualization-with-sota-min.png" style="width:85%" alt="Banner">
81+
<p class="text-left"><b>Figure 2:</b> Visualization of predictions on MCubeS and PST900 datasets. Figure 2(a) shows RGB and all modalities (RGB-A-D-N) prediction from CMNeXt and our model on MCubeS dataset. For brevity, we only show the RGB image and ground truth material segmentation maps along with the predictions. Figure 2(b) shows predictions from RTFNet, FDCNet and our model for RGB-thermal input modalities on PST900 dataset. Our model shows better predictions on both of the datasets.</p>
82+
</div>
83+
</div>
84+
85+
<div class="row">
86+
<div class="col text-center">
87+
<img src="./img/mcubes-per-class-sota-min.png" style="width:100%" alt="Banner">
88+
<p class="text-left"><b>Table 2:</b> Per-class % IoU comparison on MCubeS dataset. Our proposed MMSFormer model shows better performance in detecting most of the classes compared to the current state-of-the-art models. ∗ indicates that the code and pretrained model from the authors were used to generate the results.</p>
8189
</div>
8290
</div>
8391

8492
<div class="row">
8593
<div class="col text-center">
86-
<img src="./img/ours-vs-cmnext.png" style="width:85%" alt="Banner">
87-
<p class="text-left"><b>Figure 3:</b> Visualization of results for RGB and all modalities (RGB-A-D-N) prediction with CMNeXt and our proposed MMSFormer. For brevity, we only show the RGB image and Ground Truth material segmentation maps. Our model provides overall better results and correctly identifies asphalt, gravel, and road markings as indicated in the rectangular bounding boxes.</p>
94+
<img src="./img/FMB-performance-min.png" style="width:100%" alt="Banner">
95+
<p class="text-left"><b>Table 3:</b> Per-class % IoU comparison on FMB dataset for both RGB only and RGB-infrared modalities. We show the comparison for 8 classes (out of 14) that are published. T-Lamp and T-Sign stand for Traffic Lamp and Traffic Sign respectively. Our model outperforms all the methods for all the classes except for the truck class.</p>
8896
</div>
8997
</div>
9098

9199
<div class="row">
92100
<div class="col text-center">
93-
<img src="./img/perclass-iou-comparison-with-sota.png" style="width:100%" alt="Banner">
94-
<p class="text-left"><b>Table 2:</b> Per class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
101+
<img src="./img/PST-performance-min.png" style="width:100%" alt="Banner">
102+
<p class="text-left"><b>Table 4:</b> Performance comparison on PST900 dataset. We show per-class % IoU as well as % mIoU for all the classes.</p>
95103
</div>
96104
</div>
97105
</div>
@@ -101,15 +109,15 @@ <h2>Effect of Adding Different Modalities</h2>
101109
<hr>
102110
<div class="row">
103111
<div class="col text-center">
104-
<img src="./img/perclass-iou-when-incrementally-adding-modality.png" style="width:100%" alt="Banner">
105-
<p class="text-left"><b>Table 3:</b> Per class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
112+
<img src="./img/mcubes-per-class-modality-combination-min.png" style="width:100%" alt="Banner">
113+
<p class="text-left"><b>Table 5:</b> er class % IoU comparison on Multimodal Material Segmentation (MCubeS) dataset for different modality combinations. As we add modalities incrementally, overall performance increases gradually. This table also shows that specific modality combinations assist in identifying specific types of materials better.</p>
106114
</div>
107115
</div>
108116

109117
<div class="row">
110118
<div class="col text-center">
111-
<img src="./img/effect-of-modality-addition.png" style="width:85%" alt="Banner">
112-
<p class="text-left"><b>Figure 4:</b> Visualization of results for prediction using different modality combinations of our proposed MMSFormer model. Detection of concrete, gravel and road markings become more accurate as we add more modalities as shown in the rectangular bounding boxes.</p>
119+
<img src="./img/visualization-modality-combination-min.png" style="width:85%" alt="Banner">
120+
<p class="text-left"><b>Figure 3:</b> The figure below shows the visualization of predicted segmentation maps for different modality combinations on MCubeS and FMB datasets. Both figures show that prediction accuracy increases as we incrementally add new modalities. They also illustrate the fusion block’s ability to effectively combine information from different modality combinations.</p>
113121
</div>
114122
</div>
115123
</div>
@@ -133,7 +141,7 @@ <h2>Bibtex</h2>
133141
<hr>
134142
<div class="bibtexsection">
135143
@misc{reza2023multimodal,
136-
title={Multimodal Transformer for Material Segmentation},
144+
title={MMSFormer: Multimodal Transformer for Material and Semantic Segmentation},
137145
author={Md Kaykobad Reza and Ashley Prater-Bennette and M. Salman Asif},
138146
year={2023},
139147
eprint={2309.04001},

index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@ <h2>Computational Sensing and Information Processing Lab</h2>
4141
<div class="row">
4242
<div class="col-sm-6">
4343
<div class="card">
44-
<img class="card-img-top p-4" src="./MMSFormer/img/MMSFormer-Fusion.png" alt="MMSFormer for Multimodal Material Segmentation">
44+
<img class="card-img-top p-4" src="./MMSFormer/img/MMSFormer-min.png" alt="MMSFormer: Multimodal Transformer for Material and Semantic Segmentation">
4545
<div class="card-body text-center">
46-
<h5 class="card-title">Multimodal Transformer for Material Segmentation</h5>
46+
<h5 class="card-title">MMSFormer: Multimodal Transformer for Material and Semantic Segmentation</h5>
4747
<p class="card-text">Md Kaykobad Reza, Ashley Prater-Bennette, and M. Salman Asif</p>
4848
<a href="MMSFormer/" class="btn btn-primary text-center">Details</a>
4949
</div>

0 commit comments

Comments
 (0)