Merge branch 'main' of https://github.com/SMIL-SPCRAS/AVCR-Net into main

SMIL-SPCRAS · Feb 5, 2024 · fde6417 · fde6417
2 parents b4f5b2c + 213c646
commit fde6417
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# AVCR-Net
+# AVCR-Former
 
-The official repository for AVCR-Net
+The official repository for AVCR-Former
 
 ## Abstract
 > The article presents a methodology and evaluation for audio-visual speech recognition in driver assistive systems. Driver assistive systems require permanent interaction with driver and during the driving such interaction should be implemented based on voice control due to safety issues. The article introduces the audio-visual command recognition transformer (AVCR-Former) specifically designed for robust audio-visual speech recognition (AVSR). We propose (1) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (2) a regulated transformer based on iterative model refinement module with multiple encoders, and (3) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves the contextual information of both modalities and achieves their synchronization. The iterative model refinement module can bridge the gap between acoustic and visual data by compensating for the weaknesses of unimodal information. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model's adaptability across diverse audio-visual contexts.   Our proposed transformer achieved the highest values of accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human-computer interaction. The capabilities of AVCR-Former extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.