abstarct upd

SMIL-SPCRAS · Feb 5, 2024 · ea3a75c · ea3a75c
1 parent 7a8c97b
commit ea3a75c
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 The official repository for AVCR-Net
 
 ## Abstract
-> This research introduces the audio-visual command recognition network (AVCR-Net), an advanced model specifically designed for robust audio-visual speech recognition (AVSR). By fine-tuning its architecture, iterative refinement, and the incorporation of a gated mechanism, our model achieves the highest accuracy of 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction approaches, showcasing the model's adaptability across diverse audio-visual contexts. The AVCR-Net architecture is founded on the well-established encoder-decoder paradigm with a transformer architecture. The model's uniqueness lies in its ability to bridge the gap between acoustic and visual data, enhancing the recognition process through an iterative refinement step. The AVCR-Net architecture encompasses four primary modules: feature extraction, multimodal fusion, model initialization, and iterative model refinement. Pre-trained extractors transform audio and visual inputs into spatial-temporal features (STF) matrices. A multimodal fusion strategy merges these STFs, creating a comprehensive representation capturing both modalities' information. This representation are use as input data to AVCR-Net's encoder-decoder architecture. This generates an initial data representation and probability prediction, laying the foundation for the model. Iterative model refinement introduces, operating on the initial data representation. This refinement involves multiple steps, producing the sequences of data representations and probability predictions. The final prediction vector results from averaging all probability predictions, ensuring enhanced stability and robustness. This research also presents a comprehensive review of recent audio-visual speech corpora and state-of-the-art approaches. In addition, its relevance to AVSR, the research has wider implications for advancing human-computer interaction. The capabilities of AVCR-Net extend its impact beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.
+> The article presents a methodology and evaluation for audio-visual speech recognition in driver assistive systems. Driver assistive systems require permanent interaction with driver and during the driving such interaction should be implemented based on voice control due to safety issues. The article introduces the audio-visual command recognition transformer (AVCR-Former) specifically designed for robust audio-visual speech recognition (AVSR). We propose (1) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (2) a regulated transformer based on iterative model refinement module with multiple encoders, and (3) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves the contextual information of both modalities and achieves their synchronization. The iterative model refinement module can bridge the gap between acoustic and visual data by compensating for the weaknesses of unimodal information. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model's adaptability across diverse audio-visual contexts.   Our proposed transformer achieved the highest values of accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human-computer interaction. The capabilities of AVCR-Former extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.
 
 ## Acknowledgments
 

diff --git a/index.html b/index.html
@@ -3,10 +3,10 @@
 <head>
     <meta charset="utf-8">
     <meta name="description"
-        content="This research introduces the Audio-Visual Command Recognition Network (AVCR-Net), an advanced model specifically designed for robust audio-visual speech recognition">
+        content="The article presents a methodology and evaluation for audio-visual speech recognition in driver assistive systems. Driver assistive systems require permanent interaction with driver and during the driving such interaction should be implemented based on voice control due to safety issues. The article introduces the audio-visual command recognition transformer (AVCR-Former) specifically designed for robust audio-visual speech recognition (AVSR).">
     <meta name="keywords" content="AVCR-Net, AVSR">
     <meta name="viewport" content="width=device-width, initial-scale=1">
-    <title>AVCR-Net</title>
+    <title>AVCR-Former</title>
 
     <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
     <link rel="stylesheet" href="./static/css/bulma.min.css">
@@ -55,7 +55,7 @@
         <div class="columns is-centered">
             <div class="column has-text-centered">
             <h1 class="title is-1 publication-title">
-                AVCR-Net
+                Audio-Visual Command Recognition Based on Regulated Transformer </br> and Spatio-Temporal Fusion Strategy </br> for Driver Assistive Systems
             </h1>
             <div class="is-size-5 publication-authors">
                 <span class="author-block">
@@ -171,7 +171,7 @@ <h2 class="title is-5">TODO List</h2>
                         <h2 class="title is-3">Abstract</h2>
                         <div class="content has-text-justified">
                             <p>
-                                This research introduces the audio-visual command recognition network (AVCR-Net), an advanced model specifically designed for robust audio-visual speech recognition (AVSR). By fine-tuning its architecture, iterative refinement, and the incorporation of a gated mechanism, our model achieves the highest accuracy of 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction approaches, showcasing the model's adaptability across diverse audio-visual contexts. The AVCR-Net architecture is founded on the well-established encoder-decoder paradigm with a transformer architecture. The model's uniqueness lies in its ability to bridge the gap between acoustic and visual data, enhancing the recognition process through an iterative refinement step. The AVCR-Net architecture encompasses four primary modules: feature extraction, multimodal fusion, model initialization, and iterative model refinement. Pre-trained extractors transform audio and visual inputs into spatial-temporal features (STF) matrices. A multimodal fusion strategy merges these STFs, creating a comprehensive representation capturing both modalities' information. This representation are use as input data to AVCR-Net's encoder-decoder architecture. This generates an initial data representation and probability prediction, laying the foundation for the model. Iterative model refinement introduces, operating on the initial data representation. This refinement involves multiple steps, producing the sequences of data representations and probability predictions. The final prediction vector results from averaging all probability predictions, ensuring enhanced stability and robustness. This research also presents a comprehensive review of recent audio-visual speech corpora and state-of-the-art approaches. In addition, its relevance to AVSR, the research has wider implications for advancing human-computer interaction. The capabilities of AVCR-Net extend its impact beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.
+                                The article presents a methodology and evaluation for audio-visual speech recognition in driver assistive systems. Driver assistive systems require permanent interaction with driver and during the driving such interaction should be implemented based on voice control due to safety issues. The article introduces the audio-visual command recognition transformer (AVCR-Former) specifically designed for robust audio-visual speech recognition (AVSR). We propose (1) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (2) a regulated transformer based on iterative model refinement module with multiple encoders, and (3) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves the contextual information of both modalities and achieves their synchronization. The iterative model refinement module can bridge the gap between acoustic and visual data by compensating for the weaknesses of unimodal information. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model's adaptability across diverse audio-visual contexts.   Our proposed transformer achieved the highest values of accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human-computer interaction. The capabilities of AVCR-Former extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.
                             </p>
                         </div>
                     </div>