We introduce recent works on Awesome Video Language Understanding.
To access full version, click here.
-
VIOLETv2 (EmpiricalMVM) [Paper][Code] @Microsoft
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling (CVPR 2023) -
LAVENDER [Paper][Code] @Microsoft
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling (CVPR 2023) -
Flamingo[Paper] @DeepMind
Flamingo: a Visual Language Model for Few-Shot Learning (NeurIPS 2022) -
ALPRO [Paper][Code] @Salesforce
Align and Prompt: Video-and-Language Pre-training with Entity Prompts (CVPR 2022) -
VL-Adapter [Paper][Code] @UNC
VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (CVPR 2022) -
VIOLET [Paper][Code] @Microsoft
VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling (arXiv 2021) -
HERO [Paper][Code] @Microsoft
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training (EMNLP 2020) -
UniVL [Paper][Code] @Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation (arXiv 2020)
- FiT [Paper][Code][Website][Demo] @Oxford
Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021)
-
FrozenBiLM [Paper][Code][Website][Poster][Slides] @Inria
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models (NeurIPS 2022) -
MERLOT Reserve [Paper][Code][Website][Demo] @AI2
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound (CVPR 2022) -
MERLOT [Paper][Code][Website] @AI2
MERLOT: Multimodal Neural Script Knowledge Models (NeurIPS 2021) -
JustAsk [Paper/Journal][Code][Website][Demo][Poster][Slides][Oral] @Inria
Just Ask: Learning to Answer Questions from Millions of Narrated Videos (ICCV 2021)
Learning to Answer Visual Questions from Web Videos (TPAMI 2022)
-
Video ChatCaptioner [Paper][Code] @KAUST
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions (arXiv 2023) -
Vid2Seq [Paper][Code][Website][Blog] @Google
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (CVPR 2023) -
MV-GPT [Paper] @Google
End-to-end Generative Pretraining for Multimodal Video Captioning (CVPR 2022) -
SwinBERT [Paper][Code] @Microsoft
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (CVPR 2022)
-
WebVid-10M [Paper][Code][Website] @Oxford
Frozen in Time: A Joint Video and Image Encoder for End to End Retrieval (ICCV 2021) -
HowTo100M [Paper][Code][Website] @Inria
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (ICCV 2019)
-
STAR [Paper][Code][Website][PaperswithCode] @MIT-IBM
STAR: A Benchmark for Situated Reasoning in Real-World Videos (NeurIPS 2021) -
TVQA [Paper][Code][Website][PapersWithCode] @UNC
TVQA: Localized, Compositional Video Question Answering (EMNLP 2018) -
YouCook2 [Paper][Website][PapersWithCode] @UMich
Towards Automatic Learning of Procedures from Web Instructional Videos (AAAI 2018) -
ActivityNet Captions [Paper][Code][Website][PapersWithCode] @Stanford
Dense-Captioning Events in Videos (ICCV 2017) -
Charades-STA [Paper][Code][PapersWithCode] @USC
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ICCV 2017) -
DiDeMo [Paper][Code][PapersWithCode] @Adobe
Localizing Moments in Video with Natural Language (ICCV 2017) -
MSVD [Paper][PapersWithCode] @Microsoft
Collecting Highly Parallel Data for Paraphrase Evaluation (ACL 2017) -
LSMDC [Paper][Website][PapersWithCode] @MPII
Movie Description (IJCV 2017) -
MSR-VTT [Paper][PapersWithCode] @Microsoft
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016) -
MPII-MD [Paper][Website][PapersWithCode] @MPII
A Dataset for Movie Description (CVPR 2015)