"UnIVAL: Unified Model for Image, Video, Audio and Language Tasks", Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord pdf
"Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel O Connor pdf
"Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models", Junting Pan, Ziyi Lin pdf
"Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection", Wei-Jhe Huang, Jheng Hsien Yeh, Min-Hung Chen, Gueter Josmy Faure, Shang-Hong Lai pdf
"MMIG: Multi-Modal Image Generator", Wenjin Liu, Lijuan Zhou, Ning Luo, Bing Wei, Min Xu, Shudong Zhang pdf
"ClipCrop: Conditioned Cropping Driven by Vision-Language Model", Zhihang Zhong, Mingxi Cheng, Zhirong Wu, Yuhui Yuan, Yinqiang Zheng, Ji Li, Han Hu, Stephen Lin, Yoichi Sato, Imari Sato pdf
"Video Generation with Consistency Tuning", Chaoyi Wang, Yaozhe Song, Yafeng Zhang, Jun Pei, Lijie Xia, Jianpo Liu pdf
"HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models", eslam mohamed abdelrahman, Pengzhan Sun, xiaoqian shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny pdf
"VQA Therapy: Exploring Answer Differences by Visually Grounding Answers", Chongyan Chen, Samreen Anjum, Danna Gurari pdf
"Waffling around for Performance: Visual Classification with Random Words and Broad Concepts", Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata pdf
"Going Beyond Nouns With Vision & Language Models Using Synthetic Data", Paola Cascante-Bonilla, Khaled Shehada, James S Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gul Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky pdf
"Semi-supervised Mixture Model for Visual Language Multitask", Wenjin Liu, dan zhou, Lijuan Zhou, Ning Luo, Shudong Zhang pdf
"Painter: Teaching Auto-regressive Language Models to Draw Sketches", Reza Pourreza, Apratim Bhattacharyya, Sunny P Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic pdf
"Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification", Bo Wang, Kaili Zhao, Hongyang Zhao, Shi Pu, Bo Xiao, Jun Guo pdf
"Video-and-Language (VidL) models and their cognitive relevance", Anne W Zonneveld, Albert Gatt, Iacer Calixto pdf
"Towards an Exhaustive Evaluation of Vision-Language Foundation Models", Emmanuelle J Salin, Stephane Ayache, Benoît Favre pdf
"TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation", Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, kilho son, Tae-Hyun Oh pdf
"Coarse to Fine Frame Selection for Online Open-ended Video Question Answering", Anirudh Tunga, Sai Vidyaranya Nuthalapati pdf
"StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model", Zipeng Xu, Enver Sangineto, Nicu Sebe pdf
"Look, Remember and Reason: Visual Reasoning with Grounded Rationales", Apratim Bhattacharyya, Sunny P Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic pdf
"Divide & Bind Your Attention for Improved Generative Semantic Nursing", Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva pdf
"Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models", Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, Volker Tresp pdf
"MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof pdf
"Faithful Text-to-Image Generation via Selection", Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata pdf
"VICT: Visual In-Context Tuning", Young Kyun Jang, Dat B Huynh, Zihang Meng, Ser-Nam Lim pdf
"Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective", Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh pdf
"DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners", Clarence Lee, M Ganesh Kumar, Cheston Tan pdf
"Learning Human-Human Interactions in Images from Weak Textual Supervision", Morris Alper pdf
"Sound Source Localization is All about Cross-Modal Alignment", Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung pdf
"Multimodal Laughter Reasoning with Language Models", Lee Hyun, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh pdf
"3D Captioning for Multiple Objects with Relation", Yawen Liu, XINHAN DI, Zewen Jin, Xinrong Chen pdf
"SuS-X: Training-Free Name-Only Transfer of Vision-Language Models", Vishaal Udandarao, Ankush Gupta, Samuel Albanie pdf
"MOVSeg: Open Vocabulary Segmentation from Multi-Modal Inputs", Gonca Yilmaz, Songyou Peng, Hermann Blum pdf
"Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image", Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Danny Cohen-Or, Ariel Shamir, Amit Bermano pdf
"Language as the Medium: Multimodal Video Classification through text only", Laura Hanu, Anita Verő, James Thewlis pdf
"Teaching Structured Vision & Language Concepts to Vision & Language Models", Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, Leonid Karlinsky pdf
"LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections", Muhammad Jehanzeb Mirza pdf
"Text-only training for image captioning using noise-injected clip", David Nukrai pdf