Skip to content

IsaacRodgz/multimodal-transformers-movies

Repository files navigation

Multimodal Deep Learning for movie genre classification (MulT-GMU)

The task is to predict the movie genres from movie trailers (video frames and audio spectrogram), movie plot (text), poster (image) and metadata by using the Moviescope dataset. A new multimodal transformer architecture is proposed (MulT-GMU), which is an extension of MulT model (with dynamic modality fusion).

Publications

This repo contains the code used for the publication of a paper at NAACL 2021 MAI Workshop: Multimodal Weighted Fusion of Transformers for Movie Genre Classification (MulT-GMU)

Usage

Example of comman to run the training script

>> python mmbt/train.py --batch_sz 4 --gradient_accumulation_steps 32 --savedir /home/user/mmbt_experiments/model_save_mmtr --name moviescope_VideoTextPosterGMU_mmtr_model_run --data_path /home/user --task moviescope --task_type multilabel --model mmtrvpp --num_image_embeds 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed 1 --num_heads 6 --orig_d_v 4096 --output_gates

Mult-GMU architecture diagram

mult-gmu-diagram

Experiments mainly based on:

  • MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences.
  • MMBT: "Supervised Multimodal Bitransformers for Classifying Images and Text.
  • Moviescope Dataset: Moviescope: Large-scale Analysis of Movies using Multiple Modalities.
  • GMU Gated Multimodal Units for Information Fusion by Arevalo et al.

Versions

  • python 3.7.6
  • torch 1.5.1
  • tokenizers 0.9.4
  • transformers 4.2.2
  • Pillow 7.0.0

About

Experiments with multimodal deep learning models based on transformers

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •