Multi-level Attention Fusion Network (MAFnet) is a multimodal network that can fuse dynamically visual and audio information for audio-visual event recognition.
We release the testing code along trained models.
- Mathilde Brousmiche ([email protected])
- Stéphane Dupont ([email protected])
- Jean Rouat ([email protected])
The proposed MAFnet architecture is shown below. One video is splited into T non-overlapping clips. Then, audio and visual information are extracted with two pretrained CNNs: DenseNet [45] for visual features and VGGish [46] for audio features. The clip features are further fed into modality & temporal attention module to build a global feature containing multimodal and temporal information. This global feature is then used to predict the label of the video. A lateral connection between visual and audio pathways is added trough the FiLM layer [44].
The trained model can be downloaded here.
We train and test our model on the AVE Dataset [1]
Audio and visual feature can be downloaded here. Audio feature are extracted with a VGGish network [2] and visual feature are extracted with DenseNet [3]
Scripts for generating audio and visual features are in feature_extractor folder (Feel free to modify and use it to process your audio-visual data)
-
Python-3.6
-
Tensorflow-gpu-1.15
-
Keras
-
Scikit-learn
-
pillow
-
resampy
-
ffmpeg
-
pickle
To train the network:
python train.py --train
To test the network:
python train.py
[1] TIAN, Yapeng, SHI, Jing, LI, Bochen, et al. Audio-visual event localization in unconstrained videos. In : Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 247-263. Paper Download link
[2] HERSHEY, Shawn, CHAUDHURI, Sourish, ELLIS, Daniel PW, et al. CNN architectures for large-scale audio classification. In : IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. p. 131-135. Paper
[3] HUANG, Gao, LIU, Zhuang, VAN DER MAATEN, Laurens, et al. Densely connected convolutional networks. In : Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 4700-4708. Paper
Thanks to CHISTERA IGLU and the European Regional Development Fund (ERDF) for funding.
Audio features are extracted using VGGish and visual features are extracting using DenseNet. We thank the authors for sharing their codes.