arxiv_visual_reasoning.jsonl

{"entry_id": "2410.22995", "title": "VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning", "authors": ["Jingkun Ma", "Runzhe Zhan", "Derek F. Wong", "Yang Li", "Di Sun", "Hou Pong Chan", "Lidia S. Chao"], "published": "2024-10-30 13:19:44", "updated": "2024-10-30 13:19:44", "summary": "Although previous research on large language models (LLMs) and large\nmulti-modal models (LMMs) has systematically explored mathematical\nproblem-solving (MPS) within visual contexts, the analysis of how these models\nprocess visual information during problem-solving remains insufficient. To\naddress this gap, we present VisAidMath, a benchmark for evaluating the MPS\nprocess related to visual information. We follow a rigorous data curation\npipeline involving both automated processes and manual annotations to ensure\ndata quality and reliability. Consequently, this benchmark includes 1,200\nchallenging problems from various mathematical branches, vision-aid\nformulations, and difficulty levels, collected from diverse sources such as\ntextbooks, examination papers, and Olympiad problems. Based on the proposed\nbenchmark, we conduct comprehensive evaluations on ten mainstream LLMs and\nLMMs, highlighting deficiencies in the visual-aided reasoning process. For\nexample, GPT-4V only achieves 45.33% accuracy in the visual-aided reasoning\ntask, even with a drop of 2 points when provided with golden visual aids.\nIn-depth analysis reveals that the main cause of deficiencies lies in\nhallucination regarding the implicit visual reasoning process, shedding light\non future research directions in the visual-aided MPS process.", "comment": "58 pages, 28 figures", "links": []}
{"entry_id": "2405.17503", "title": "Code Repair with LLMs gives an Exploration-Exploitation Tradeoff", "authors": ["Hao Tang", "Keya Hu", "Jin Peng Zhou", "Sicheng Zhong", "Wei-Long Zheng", "Xujie Si", "Kevin Ellis"], "published": "2024-05-26 04:00:30", "updated": "2024-10-29 20:01:16", "summary": "Iteratively improving and repairing source code with large language models\n(LLMs), known as refinement, has emerged as a popular way of generating\nprograms that would be too complex to construct in one shot. Given a bank of\ntest cases, together with a candidate program, an LLM can improve that program\nby being prompted with failed test cases. But it remains an open question how\nto best iteratively refine code, with prior work employing simple greedy or\nbreadth-first strategies. We show here that refinement exposes an\nexplore-exploit tradeoff: exploit by refining the program that passes the most\ntest cases, or explore by refining a lesser considered program. We frame this\nas an arm-acquiring bandit problem, which we solve with Thompson Sampling. The\nresulting LLM-based program synthesis algorithm is broadly applicable: Across\nloop invariant synthesis, visual reasoning puzzles, and competition programming\nproblems, we find that our new method can solve more problems using fewer\nlanguage model calls.", "comment": null, "links": []}
{"entry_id": "2410.20883", "title": "Improving Generalization in Visual Reasoning via Self-Ensemble", "authors": ["Tien-Huy Nguyen", "Quang-Khai Tran", "Anh-Tuan Quang-Hoang"], "published": "2024-10-28 10:04:40", "updated": "2024-10-28 10:04:40", "summary": "The cognitive faculty of visual reasoning necessitates the integration of\nmultimodal perceptual processing and commonsense and external knowledge of the\nworld. In recent years, a plethora of large vision-language models (LVLMs) have\nbeen proposed, demonstrating outstanding power and exceptional proficiency in\ncommonsense reasoning across diverse domains and tasks. Nevertheless, training\nsuch LVLMs requires a lot of costly resources. Recent approaches, instead of\ntraining LVLMs from scratch on various large datasets, focus on exploring ways\nto take advantage of the capabilities of many different LVLMs, such as ensemble\nmethods. In this work, we propose self-ensemble, a novel method that improves\nthe generalization and visual reasoning of the model without updating any\nparameters, a training-free method. Our key insight is that we realized that\nLVLM itself can ensemble without the need for any other LVLMs, which helps to\nunlock their internal capabilities. Extensive experiments on various benchmarks\ndemonstrate the effectiveness of our method in achieving state-of-the-art\n(SOTA) performance on SketchyVQA, Outside Knowledge VQA, and\nout-of-distribution VQA tasks.", "comment": null, "links": []}
{"entry_id": "2410.19546", "title": "Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?", "authors": ["Antonia Wüst", "Tim Tobiasch", "Lukas Helff", "Devendra S. Dhami", "Constantin A. Rothkopf", "Kristian Kersting"], "published": "2024-10-25 13:19:26", "updated": "2024-10-25 13:19:26", "summary": "Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's\nGPT-4o, have emerged, seemingly demonstrating advanced reasoning capabilities\nacross text and image modalities. Yet, the depth of these advances in\nlanguage-guided perception and abstract reasoning remains underexplored, and it\nis unclear whether these models can truly live up to their ambitious promises.\nTo assess the progress and identify shortcomings, we enter the wonderland of\nBongard problems, a set of classical visual reasoning puzzles that require\nhuman-like abilities of pattern recognition and abstract reasoning. While VLMs\noccasionally succeed in identifying discriminative concepts and solving some of\nthe problems, they frequently falter, failing to understand and reason about\nvisual concepts. Surprisingly, even elementary concepts that may seem trivial\nto humans, such as simple spirals, pose significant challenges. Moreover, even\nwhen asked to explicitly focus on and analyze these concepts, they continue to\nfalter, suggesting not only a lack of understanding of these elementary visual\nconcepts but also an inability to generalize to unseen concepts. These\nobservations underscore the current limitations of VLMs, emphasize that a\nsignificant gap remains between human-like visual reasoning and machine\ncognition, and highlight the ongoing need for innovation in this area.", "comment": null, "links": []}
{"entry_id": "2410.18976", "title": "CAMEL-Bench: A Comprehensive Arabic LMM Benchmark", "authors": ["Sara Ghaboura", "Ahmed Heakl", "Omkar Thawakar", "Ali Alharthi", "Ines Riahi", "Abduljalil Saif", "Jorma Laaksonen", "Fahad S. Khan", "Salman Khan", "Rao M. Anwer"], "published": "2024-10-24 17:59:38", "updated": "2024-10-24 17:59:38", "summary": "Recent years have witnessed a significant interest in developing large\nmultimodal models (LMMs) capable of performing various visual reasoning and\nunderstanding tasks. This has led to the introduction of multiple LMM\nbenchmarks to evaluate LMMs on different tasks. However, most existing LMM\nevaluation benchmarks are predominantly English-centric. In this work, we\ndevelop a comprehensive LMM evaluation benchmark for the Arabic language to\nrepresent a large population of over 400 million speakers. The proposed\nbenchmark, named CAMEL-Bench, comprises eight diverse domains and 38\nsub-domains including, multi-image understanding, complex visual perception,\nhandwritten document understanding, video understanding, medical imaging, plant\ndiseases, and remote sensing-based land use understanding to evaluate broad\nscenario generalizability. Our CAMEL-Bench comprises around 29,036 questions\nthat are filtered from a larger pool of samples, where the quality is manually\nverified by native speakers to ensure reliable model assessment. We conduct\nevaluations of both closed-source, including GPT-4 series, and open-source\nLMMs. Our analysis reveals the need for substantial improvement, especially\namong the best open-source models, with even the closed-source GPT-4o achieving\nan overall score of 62%. Our benchmark and evaluation scripts are open-sourced.", "comment": "10 pages, 5 figures, NAACL", "links": []}
{"entry_id": "2410.18798", "title": "Distill Visual Chart Reasoning Ability from LLMs to MLLMs", "authors": ["Wei He", "Zhiheng Xi", "Wanxu Zhao", "Xiaoran Fan", "Yiwen Ding", "Zifei Shan", "Tao Gui", "Qi Zhang", "Xuanjing Huang"], "published": "2024-10-24 14:50:42", "updated": "2024-10-24 14:50:42", "summary": "Solving complex chart Q&A tasks requires advanced visual reasoning abilities\nin multimodal large language models (MLLMs). Recent studies highlight that\nthese abilities consist of two main parts: recognizing key information from\nvisual inputs and conducting reasoning over it. Thus, a promising approach to\nenhance MLLMs is to construct relevant training data focusing on the two\naspects. However, collecting and annotating complex charts and questions is\ncostly and time-consuming, and ensuring the quality of annotated answers\nremains a challenge. In this paper, we propose Code-as-Intermediary Translation\n(CIT), a cost-effective, efficient and easily scalable data synthesis method\nfor distilling visual reasoning abilities from LLMs to MLLMs. The code serves\nas an intermediary that translates visual chart representations into textual\nrepresentations, enabling LLMs to understand cross-modal information.\nSpecifically, we employ text-based synthesizing techniques to construct\nchart-plotting code and produce ReachQA, a dataset containing 3k\nreasoning-intensive charts and 20k Q&A pairs to enhance both recognition and\nreasoning abilities. Experiments show that when fine-tuned with our data,\nmodels not only perform well on chart-related benchmarks, but also demonstrate\nimproved multimodal reasoning abilities on general mathematical benchmarks like\nMathVista. The code and dataset are publicly available at\nhttps://github.com/hewei2001/ReachQA.", "comment": "Under review. The code and dataset are publicly available at\n  https://github.com/hewei2001/ReachQA", "links": []}
{"entry_id": "2410.12381", "title": "HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks", "authors": ["Fengji Zhang", "Linquan Wu", "Huiyu Bai", "Guancheng Lin", "Xiao Li", "Xiao Yu", "Yue Wang", "Bei Chen", "Jacky Keung"], "published": "2024-10-16 09:04:57", "updated": "2024-10-24 13:33:58", "summary": "Coding tasks have been valuable for evaluating Large Language Models (LLMs),\nas they demand the comprehension of high-level instructions, complex reasoning,\nand the implementation of functional programs -- core capabilities for\nadvancing Artificial General Intelligence. Despite the progress in Large\nMultimodal Models (LMMs), which extend LLMs with visual perception and\nunderstanding capabilities, there remains a notable lack of coding benchmarks\nthat rigorously assess these models, particularly in tasks that emphasize\nvisual reasoning. To address this gap, we introduce HumanEval-V, a novel and\nlightweight benchmark specifically designed to evaluate LMMs' visual\nunderstanding and reasoning capabilities through code generation. HumanEval-V\nincludes 108 carefully crafted, entry-level Python coding tasks derived from\nplatforms like CodeForces and Stack Overflow. Each task is adapted by modifying\nthe context and algorithmic patterns of the original problems, with visual\nelements redrawn to ensure distinction from the source, preventing potential\ndata leakage. LMMs are required to complete the code solution based on the\nprovided visual context and a predefined Python function signature outlining\nthe task requirements. Every task is equipped with meticulously handcrafted\ntest cases to ensure a thorough and reliable evaluation of model-generated\nsolutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering\nsignificant challenges. Proprietary models like GPT-4o achieve only 13% pass@1\nand 36.4% pass@10, while open-weight models with 70B parameters score below 4%\npass@1. Ablation studies further reveal the limitations of current LMMs in\nvision reasoning and coding capabilities. These results underscore key areas\nfor future research to enhance LMMs' capabilities. We have open-sourced our\ncode and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark.", "comment": "homepage https://humaneval-v.github.io/", "links": []}
{"entry_id": "2406.09949", "title": "Neural Concept Binder", "authors": ["Wolfgang Stammer", "Antonia Wüst", "David Steinmann", "Kristian Kersting"], "published": "2024-06-14 11:52:09", "updated": "2024-10-24 12:13:54", "summary": "The challenge in object-based visual reasoning lies in generating concept\nrepresentations that are both descriptive and distinct. Achieving this in an\nunsupervised manner requires human users to understand the model's learned\nconcepts and, if necessary, revise incorrect ones. To address this challenge,\nwe introduce the Neural Concept Binder (NCB), a novel framework for deriving\nboth discrete and continuous concept representations, which we refer to as\n\"concept-slot encodings\". NCB employs two types of binding: \"soft binding\",\nwhich leverages the recent SysBinder mechanism to obtain object-factor\nencodings, and subsequent \"hard binding\", achieved through hierarchical\nclustering and retrieval-based inference. This enables obtaining expressive,\ndiscrete representations from unlabeled images. Moreover, the structured nature\nof NCB's concept representations allows for intuitive inspection and the\nstraightforward integration of external knowledge, such as human input or\ninsights from other AI models like GPT-4. Additionally, we demonstrate that\nincorporating the hard binding mechanism preserves model performance while\nenabling seamless integration into both neural and symbolic modules for complex\nreasoning tasks. We validate the effectiveness of NCB through evaluations on\nour newly introduced CLEVR-Sudoku dataset.", "comment": null, "links": []}
{"entry_id": "2406.18925", "title": "Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding", "authors": ["Jiwan Chung", "Sungjae Lee", "Minseo Kim", "Seungju Han", "Ashkan Yousefpour", "Jack Hessel", "Youngjae Yu"], "published": "2024-06-27 06:32:56", "updated": "2024-10-23 02:57:31", "summary": "Visual arguments, often used in advertising or social causes, rely on images\nto persuade viewers to do or believe something. Understanding these arguments\nrequires selective vision: only specific visual stimuli within an image are\nrelevant to the argument, and relevance can only be understood within the\ncontext of a broader argumentative structure. While visual arguments are\nreadily appreciated by human audiences, we ask: are today's AI capable of\nsimilar understanding? We present VisArgs, a dataset of 1,611 images annotated\nwith 5,112 visual premises (with regions), 5,574 commonsense premises, and\nreasoning trees connecting them into structured arguments. We propose three\ntasks for evaluating visual argument understanding: premise localization,\npremise identification, and conclusion deduction. Experiments show that 1)\nmachines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy,\nwhile humans reached 98.0%. Models also performed 19.5% worse when\ndistinguishing between irrelevant objects within the image compared to external\nobjects. 2) Providing relevant visual premises improved model performance\nsignificantly.", "comment": "12 pages, 6 figures. Accepted as main paper in EMNLP 2024", "links": []}
{"entry_id": "2403.10853", "title": "Just Say the Name: Online Continual Learning with Category Names Only via Data Generation", "authors": ["Minhyuk Seo", "Seongwon Cho", "Minjae Lee", "Diganta Misra", "Hyeonbeom Choi", "Seon Joo Kim", "Jonghyun Choi"], "published": "2024-03-16 08:28:42", "updated": "2024-10-19 14:51:45", "summary": "Requiring extensive human supervision is often impractical for continual\nlearning due to its cost, leading to the emergence of 'name-only continual\nlearning' that only provides the name of new concepts (e.g., classes) without\nproviding supervised samples. To address the task, recent approach uses\nweb-scraped data but results in issues such as data imbalance, copyright, and\nprivacy concerns. To overcome the limitations of both human supervision and\nwebly supervision, we propose Generative name only Continual Learning (GenCL)\nusing generative models for the name only continual learning. But na\\\"ive\napplication of generative models results in limited diversity of generated\ndata. So, we specifically propose a diverse prompt generation method,\nHIerarchical Recurrent Prompt Generation (HIRPG) as well as\nCOmplexity-NAvigating eNsembler (CONAN) that selects samples with minimal\noverlap from multiple generative models. We empirically validate that the\nproposed GenCL outperforms prior arts, even a model trained with fully\nsupervised data, in various tasks including image recognition and multi-modal\nvisual reasoning. Data generated by GenCL is available at\nhttps://anonymous.4open.science/r/name-only-continual-E079.", "comment": null, "links": []}
{"entry_id": "2410.14138", "title": "ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom", "authors": ["Jingqi Zhou", "Sheng Wang", "Jingwei Dong", "Lei Li", "Jiahui Gao", "Lingpeng Kong", "Chuan Wu"], "published": "2024-10-18 03:22:06", "updated": "2024-10-18 03:22:06", "summary": "Large vision-language models (LVLMs) have witnessed significant progress on\nvisual understanding tasks. However, they often prioritize language knowledge\nover image information on visual reasoning tasks, incurring performance\ndegradation. To tackle this issue, we first identify the drawbacks of existing\nsolutions (i.e., insufficient and irrelevant visual descriptions, and limited\nmulti-modal capacities). We then decompose visual reasoning process into two\nstages: visual perception (i.e., eyesight) and textual reasoning (i.e.,\nwisdom), and introduce a novel visual reasoning framework named ProReason. This\nframework features multi-run proactive perception and decoupled\nvision-reasoning capabilities. Briefly, given a multi-modal question, ProReason\niterates proactive information collection and reasoning until the answer can be\nconcluded with necessary and sufficient visual descriptions. Notably, the\ndisassociation of capabilities allows seamless integration of existing large\nlanguage models (LLMs) to compensate for the reasoning deficits of LVLMs. Our\nextensive experiments demonstrate that ProReason outperforms both existing\nmulti-step reasoning frameworks and passive peer methods on a wide range of\nbenchmarks for both open-source and closed-source models. In addition, with the\nassistance of LLMs, ProReason achieves a performance improvement of up to 15%\non MMMU benchmark. Our insights into existing solutions and the decoupled\nperspective for feasible integration of LLMs illuminate future research on\nvisual reasoning techniques, especially LLM-assisted ones.", "comment": null, "links": []}
{"entry_id": "2409.02253", "title": "How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?", "authors": ["Saeid Asgari Taghanaki", "Joseph Lambourne", "Alana Mongkhounsavath"], "published": "2024-09-03 19:26:13", "updated": "2024-10-15 18:40:09", "summary": "Large foundation models have revolutionized the field, yet challenges remain\nin optimizing multi-modal models for specialized visual tasks. We propose a\nnovel, generalizable methodology to identify preferred image distributions for\nblack-box Vision-Language Models (VLMs) by measuring output consistency across\nvaried input prompts. Applying this to different rendering types of 3D objects,\nwe demonstrate its efficacy across various domains requiring precise\ninterpretation of complex structures, with a focus on Computer-Aided Design\n(CAD) as an exemplar field. We further refine VLM outputs using in-context\nlearning with human feedback, significantly enhancing explanation quality. To\naddress the lack of benchmarks in specialized domains, we introduce CAD-VQA, a\nnew dataset for evaluating VLMs on CAD-related visual question answering tasks.\nOur evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline\nperformance levels, providing a framework for advancing VLM capabilities in\ncomplex visual reasoning tasks across various fields requiring expert-level\nvisual interpretation. We release the dataset and evaluation codes at\n\\url{https://github.com/asgsaeid/cad_vqa}.", "comment": "Accepted to NeurIPS 2024, Safe Generative AI", "links": []}
{"entry_id": "2410.11538", "title": "MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark", "authors": ["Bin Shan", "Xiang Fei", "Wei Shi", "An-Lan Wang", "Guozhi Tang", "Lei Liao", "Jingqun Tang", "Xiang Bai", "Can Huang"], "published": "2024-10-15 12:13:42", "updated": "2024-10-15 12:13:42", "summary": "The comprehension of text-rich visual scenes has become a focal point for\nevaluating Multi-modal Large Language Models (MLLMs) due to their widespread\napplications. Current benchmarks tailored to the scenario emphasize perceptual\ncapabilities, while overlooking the assessment of cognitive abilities. To\naddress this limitation, we introduce a Multimodal benchmark towards Text-rich\nvisual scenes, to evaluate the Cognitive capabilities of MLLMs through visual\nreasoning and content-creation tasks (MCTBench). To mitigate potential\nevaluation bias from the varying distributions of datasets, MCTBench\nincorporates several perception tasks (e.g., scene text recognition) to ensure\na consistent comparison of both the cognitive and perceptual capabilities of\nMLLMs. To improve the efficiency and fairness of content-creation evaluation,\nwe conduct an automatic evaluation pipeline. Evaluations of various MLLMs on\nMCTBench reveal that, despite their impressive perceptual capabilities, their\ncognition abilities require enhancement. We hope MCTBench will offer the\ncommunity an efficient resource to explore and enhance cognitive capabilities\ntowards text-rich visual scenes.", "comment": "12 pages, 5 figures, project page:\n  https://github.com/xfey/MCTBench?tab=readme-ov-file", "links": []}
{"entry_id": "2410.10238", "title": "ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization", "authors": ["Jiawei Li", "Fanrui Zhang", "Jiaying Zhu", "Esther Sun", "Qiang Zhang", "Zheng-Jun Zha"], "published": "2024-10-14 07:56:51", "updated": "2024-10-14 07:56:51", "summary": "Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong\ncapabilities in visual reasoning and explanation generation. However, despite\nthese strengths, they face significant challenges in the increasingly critical\ntask of Image Forgery Detection and Localization (IFDL). Moreover, existing\nIFDL methods are typically limited to the learning of low-level\nsemantic-agnostic clues and merely provide a single outcome judgment. To tackle\nthese issues, we propose ForgeryGPT, a novel framework that advances the IFDL\ntask by capturing high-order forensics knowledge correlations of forged images\nfrom diverse linguistic feature spaces, while enabling explainable generation\nand interactive dialogue through a newly customized Large Language Model (LLM)\narchitecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating\nthe Mask-Aware Forgery Extractor, which enables the excavating of precise\nforgery mask information from input images and facilitating pixel-level\nunderstanding of tampering artifacts. The Mask-Aware Forgery Extractor consists\nof a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the\nFL-Expert is augmented with an Object-agnostic Forgery Prompt and a\nVocabulary-enhanced Vision Encoder, allowing for effectively capturing of\nmulti-scale fine-grained forgery details. To enhance its performance, we\nimplement a three-stage training strategy, supported by our designed Mask-Text\nAlignment and IFDL Task-Specific Instruction Tuning datasets, which align\nvision-language modalities and improve forgery detection and\ninstruction-following capabilities. Extensive experiments demonstrate the\neffectiveness of the proposed method.", "comment": "16 pages, 14 figures", "links": []}
{"entry_id": "2402.06118", "title": "ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling", "authors": ["Siming Yan", "Min Bai", "Weifeng Chen", "Xiong Zhou", "Qixing Huang", "Li Erran Li"], "published": "2024-02-09 01:00:14", "updated": "2024-10-13 14:06:12", "summary": "By combining natural language understanding, generation capabilities, and\nbreadth of knowledge of large language models with image perception, recent\nlarge vision language models (LVLMs) have shown unprecedented visual reasoning\ncapabilities. However, the generated text often suffers from inaccurate\ngrounding in the visual input, resulting in errors such as hallucination of\nnonexistent scene elements, missing significant parts of the scene, and\ninferring incorrect attributes of and relationships between objects. To address\nthese issues, we introduce a novel framework, ViGoR (Visual Grounding Through\nFine-Grained Reward Modeling) that utilizes fine-grained reward modeling to\nsignificantly enhance the visual grounding of LVLMs over pre-trained baselines.\nThis improvement is efficiently achieved using much cheaper human evaluations\ninstead of full supervisions, as well as automated methods. We show the\neffectiveness of our approach through a variety of evaluation methods and\nbenchmarks. Additionally, we released our human annotation\n(https://github.com/amazon-science/vigor) comprising 15,440 images and\ngenerated text pairs with fine-grained evaluations to contribute to related\nresearch in the community.", "comment": "10 pages, 3 figures", "links": []}
{"entry_id": "2410.09489", "title": "Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks", "authors": ["Sungkyung Kim", "Adam Lee", "Junyoung Park", "Andrew Chung", "Jusang Oh", "Jay-Yoon Lee"], "published": "2024-10-12 10:51:05", "updated": "2024-10-12 10:51:05", "summary": "Recent advancements in large language models have demonstrated enhanced\ncapabilities in visual reasoning tasks by employing additional encoders for\naligning different modalities. While the Q-Former has been widely used as a\ngeneral encoder for aligning several modalities including image, video, audio,\nand 3D with large language models, previous works on its efficient training and\nthe analysis of its individual components have been limited. In this work, we\ninvestigate the effectiveness of parameter efficient fine-tuning (PEFT) the\nQ-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and\nIconQA. We observe that applying PEFT to the Q-Former achieves comparable\nperformance to full fine-tuning using under 2% of the trainable parameters.\nAdditionally, we employ AdaLoRA for dynamic parameter budget reallocation to\nexamine the relative importance of the Q-Former's sublayers with 4 different\nbenchmarks. Our findings reveal that the self-attention layers are noticeably\nmore important in perceptual visual-language reasoning tasks, and relative\nimportance of FFN layers depends on the complexity of visual-language patterns\ninvolved in tasks. The code is available at\nhttps://github.com/AttentionX/InstructBLIP_PEFT.", "comment": "EMNLP 2024 Findings", "links": []}
{"entry_id": "2405.15683", "title": "Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs", "authors": ["Sreyan Ghosh", "Chandra Kiran Reddy Evuru", "Sonal Kumar", "Utkarsh Tyagi", "Oriol Nieto", "Zeyu Jin", "Dinesh Manocha"], "published": "2024-05-24 16:21:59", "updated": "2024-10-12 06:17:23", "summary": "Large Vision-Language Models (LVLMs) often produce responses that misalign\nwith factual information, a phenomenon known as hallucinations. While\nhallucinations are well-studied, the exact causes behind them remain\nunderexplored. In this paper, we first investigate the root causes of\nhallucinations in LVLMs. Our findings reveal that existing mitigation\ntechniques primarily reduce hallucinations for visual recognition prompts-those\nthat require simple descriptions of visual elements-but fail for cognitive\nprompts that demand deliberate reasoning. We identify the core issue as a lack\nof true visual perception in LVLMs: although they can accurately recognize\nvisual elements, they struggle to fully interpret these elements in the context\nof the input prompt and effectively link this recognition to their internal\nknowledge, which is critical for reasoning. To address this gap, we introduce\nVisual Description Grounded Decoding (VDGD), a simple, robust, and\ntraining-free method designed to enhance visual perception and improve\nreasoning capabilities in LVLMs. VDGD works by first generating a detailed\ndescription of the image and appending it as a prefix to the instruction.\nDuring response generation, tokens are sampled based on their KL divergence to\nthe description, favoring candidates with lower divergence. Experimental\nresults on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD\nconsistently outperforms existing baselines 2% - 33%. Finally, we introduce\nVaLLu, a benchmark designed for comprehensive evaluation of the cognitive\ncapabilities of LVLMs.", "comment": "Preprint. Under review", "links": []}
{"entry_id": "2406.19934", "title": "From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis", "authors": ["Chuanqi Cheng", "Jian Guan", "Wei Wu", "Rui Yan"], "published": "2024-06-28 14:04:10", "updated": "2024-10-11 15:41:23", "summary": "We explore multi-step reasoning in vision-language models (VLMs). The problem\nis challenging, as reasoning data consisting of multiple steps of visual and\nlanguage processing are barely available. To overcome the challenge, we first\nintroduce a least-to-most visual reasoning paradigm, which interleaves steps of\ndecomposing a question into sub-questions and invoking external tools for\nresolving sub-questions. Based on the paradigm, we further propose a novel data\nsynthesis approach that can automatically create questions and multi-step\nreasoning paths for an image in a bottom-up manner. Our approach divides the\ncomplex synthesis task into a few simple sub-tasks, and (almost entirely)\nrelies on open-sourced models to accomplish the sub-tasks. Therefore, the\nentire synthesis process is reproducible and cost-efficient, and the\nsynthesized data is quality guaranteed. With the approach, we construct $50$k\nvisual reasoning examples. Then, we develop a visual reasoner through\nsupervised fine-tuning, which is capable of generally enhancing the reasoning\nabilities of a wide range of existing VLMs in a plug-and-play fashion.\nExtensive experiments indicate that the visual reasoner can consistently and\nsignificantly improve four VLMs on four VQA benchmarks. Our code and dataset\nare available at https://github.com/steven-ccq/VisualReasoner.", "comment": "Accepted by EMNLP 2024", "links": []}
{"entry_id": "2407.19094", "title": "Solving Robotics Problems in Zero-Shot with Vision-Language Models", "authors": ["Zidan Wang", "Rui Shen", "Bradly Stadie"], "published": "2024-07-26 21:18:57", "updated": "2024-10-11 04:58:00", "summary": "We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM)\nframework designed to solve robotics problems in a zero-shot regime. In our\ncontext, zero-shot means that for a novel environment, we provide a VLLM with\nan image of the robot's surroundings and a task description, and the VLLM\noutputs the sequence of actions necessary for the robot to complete the task.\nUnlike prior work that requires fine-tuning parts of the pipeline -- such as\nadjusting an LLM on robot-specific data or training separate vision encoders --\nour approach demonstrates that with careful engineering, a single off-the-shelf\nVLLM can autonomously handle all aspects of a robotics task, from high-level\nplanning to low-level location extraction and action execution. Crucially,\ncompared to using GPT-4o alone, Wonderful Team is self-corrective and capable\nof iteratively fixing its own mistakes, enabling it to solve challenging\nlong-horizon tasks. We validate our framework through extensive experiments,\nboth in simulated environments using VIMABench and in real-world settings. Our\nsystem showcases the ability to handle diverse tasks such as manipulation,\ngoal-reaching, and visual reasoning -- all in a zero-shot manner. These results\nunderscore a key point: vision-language models have progressed rapidly in the\npast year and should be strongly considered as a backbone for many robotics\nproblems moving forward.", "comment": "aka Wonderful Team", "links": []}
{"entry_id": "2406.13246", "title": "GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs", "authors": ["Navid Rajabi", "Jana Kosecka"], "published": "2024-06-19 06:15:26", "updated": "2024-10-10 22:22:52", "summary": "The ability to understand and reason about spatial relationships between\nobjects in images is an important component of visual reasoning. This skill\nrests on the ability to recognize and localize objects of interest and\ndetermine their spatial relation. Early vision and language models (VLMs) have\nbeen shown to struggle to recognize spatial relations. We extend the previously\nreleased What'sUp dataset and propose a novel comprehensive evaluation for\nspatial relationship understanding that highlights the strengths and weaknesses\nof 27 different models. In addition to the VLMs evaluated in What'sUp, our\nextensive evaluation encompasses 3 classes of Multimodal LLMs (MLLMs) that vary\nin their parameter sizes (ranging from 7B to 110B), training/instruction-tuning\nmethods, and visual resolution to benchmark their performances and scrutinize\nthe scaling laws in this task.", "comment": "Accepted to NeurIPS 2024 Workshop on Compositional Learning", "links": []}
{"entry_id": "2410.07752", "title": "TVBench: Redesigning Video-Language Evaluation", "authors": ["Daniel Cores", "Michael Dorkenwald", "Manuel Mucientes", "Cees G. M. Snoek", "Yuki M. Asano"], "published": "2024-10-10 09:28:36", "updated": "2024-10-10 09:28:36", "summary": "Large language models have demonstrated impressive performance when\nintegrated with vision models even enabling video understanding. However,\nevaluating these video models presents its own unique challenges, for which\nseveral benchmarks have been proposed. In this paper, we show that the\ncurrently most used video-language benchmarks can be solved without requiring\nmuch temporal reasoning. We identified three main issues in existing datasets:\n(i) static information from single frames is often sufficient to solve the\ntasks (ii) the text of the questions and candidate answers is overly\ninformative, allowing models to answer correctly without relying on any visual\ninput (iii) world knowledge alone can answer many of the questions, making the\nbenchmarks a test of knowledge replication rather than visual reasoning. In\naddition, we found that open-ended question-answering benchmarks for video\nunderstanding suffer from similar issues while the automatic evaluation process\nwith LLMs is unreliable, making it an unsuitable alternative. As a solution, we\npropose TVBench, a novel open-source video multiple-choice question-answering\nbenchmark, and demonstrate through extensive evaluations that it requires a\nhigh level of temporal understanding. Surprisingly, we find that most recent\nstate-of-the-art video-language models perform similarly to random performance\non TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.", "comment": null, "links": []}
{"entry_id": "2410.06405", "title": "Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects", "authors": ["Wenhao Li", "Yudong Xu", "Scott Sanner", "Elias Boutros Khalil"], "published": "2024-10-08 22:25:34", "updated": "2024-10-08 22:25:34", "summary": "The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on\nvisual reasoning in the evaluation of Artificial Intelligence systems. In its\noriginal framing, an ARC task requires solving a program synthesis problem over\nsmall 2D images using a few input-output training pairs. In this work, we adopt\nthe recently popular data-driven approach to the ARC and ask whether a Vision\nTransformer (ViT) can learn the implicit mapping, from input image to output\nimage, that underlies the task. We show that a ViT -- otherwise a\nstate-of-the-art model for images -- fails dramatically on most ARC tasks even\nwhen trained on one million examples per task. This points to an inherent\nrepresentational deficiency of the ViT architecture that makes it incapable of\nuncovering the simple structured mappings underlying the ARC tasks. Building on\nthese insights, we propose ViTARC, a ViT-style architecture that unlocks some\nof the visual reasoning capabilities required by the ARC. Specifically, we use\na pixel-level input representation, design a spatially-aware tokenization\nscheme, and introduce a novel object-based positional encoding that leverages\nautomatic segmentation, among other enhancements. Our task-specific ViTARC\nmodels achieve a test solve rate close to 100% on more than half of the 400\npublic ARC tasks strictly through supervised learning from input-output grids.\nThis calls attention to the importance of imbuing the powerful (Vision)\nTransformer with the correct inductive biases for abstract visual reasoning\nthat are critical even when the training data is plentiful and the mapping is\nnoise-free. Hence, ViTARC provides a strong foundation for future research in\nvisual reasoning using transformer-based architectures.", "comment": null, "links": []}
{"entry_id": "2410.13883", "title": "Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends", "authors": ["Mirna Al-Shetairy", "Hanan Hindy", "Dina Khattab", "Mostafa M. Aref"], "published": "2024-10-05 16:26:44", "updated": "2024-10-05 16:26:44", "summary": "In recent years, interest in vision-language tasks has grown, especially\nthose involving chart interactions. These tasks are inherently multimodal,\nrequiring models to process chart images, accompanying text, underlying data\ntables, and often user queries. Traditionally, Chart Understanding (CU) relied\non heuristics and rule-based systems. However, recent advancements that have\nintegrated transformer architectures significantly improved performance. This\npaper reviews prominent research in CU, focusing on State-of-The-Art (SoTA)\nframeworks that employ transformers within End-to-End (E2E) solutions. Relevant\nbenchmarking datasets and evaluation techniques are analyzed. Additionally,\nthis article identifies key challenges and outlines promising future directions\nfor advancing CU solutions. Following the PRISMA guidelines, a comprehensive\nliterature search is conducted across Google Scholar, focusing on publications\nfrom Jan'20 to Jun'24. After rigorous screening and quality assessment, 32\nstudies are selected for in-depth analysis. The CU tasks are categorized into a\nthree-layered paradigm based on the cognitive task required. Recent\nadvancements in the frameworks addressing various CU tasks are also reviewed.\nFrameworks are categorized into single-task or multi-task based on the number\nof tasks solvable by the E2E solution. Within multi-task frameworks,\npre-trained and prompt-engineering-based techniques are explored. This review\noverviews leading architectures, datasets, and pre-training tasks. Despite\nsignificant progress, challenges remain in OCR dependency, handling\nlow-resolution images, and enhancing visual reasoning. Future directions\ninclude addressing these challenges, developing robust benchmarks, and\noptimizing model efficiency. Additionally, integrating explainable AI\ntechniques and exploring the balance between real and synthetic data are\ncrucial for advancing CU research.", "comment": null, "links": []}
{"entry_id": "2406.13444", "title": "VDebugger: Harnessing Execution Feedback for Debugging Visual Programs", "authors": ["Xueqing Wu", "Zongyu Lin", "Songyan Zhao", "Te-Lin Wu", "Pan Lu", "Nanyun Peng", "Kai-Wei Chang"], "published": "2024-06-19 11:09:16", "updated": "2024-10-04 04:56:35", "summary": "Visual programs are executable code generated by large language models to\naddress visual reasoning problems. They decompose complex questions into\nmultiple reasoning steps and invoke specialized models for each step to solve\nthe problems. However, these programs are prone to logic errors, with our\npreliminary evaluation showing that 58% of the total errors are caused by\nprogram logic errors. Debugging complex visual programs remains a major\nbottleneck for visual reasoning. To address this, we introduce VDebugger, a\nnovel critic-refiner framework trained to localize and debug visual programs by\ntracking execution step by step. VDebugger identifies and corrects program\nerrors leveraging detailed execution feedback, improving interpretability and\naccuracy. The training data is generated through an automated pipeline that\ninjects errors into correct visual programs using a novel mask-best decoding\ntechnique. Evaluations on six datasets demonstrate VDebugger's effectiveness,\nshowing performance improvements of up to 3.2% in downstream task accuracy.\nFurther studies show VDebugger's ability to generalize to unseen tasks,\nbringing a notable improvement of 2.3% on the unseen COVR task. Code, data and\nmodels are made publicly available at https://github.com/shirley-wu/vdebugger/", "comment": "EMNLP 2024 Findings", "links": []}
{"entry_id": "2404.06479", "title": "Visually Descriptive Language Model for Vector Graphics Reasoning", "authors": ["Zhenhailong Wang", "Joy Hsu", "Xingyao Wang", "Kuan-Hao Huang", "Manling Li", "Jiajun Wu", "Heng Ji"], "published": "2024-04-09 17:30:18", "updated": "2024-10-03 21:59:32", "summary": "Despite significant advancements, large multimodal models (LMMs) still\nstruggle to bridge the gap between low-level visual perception -- focusing on\nshapes, sizes, and layouts -- and high-level language reasoning, such as\nsemantics and logic. This limitation is evident in tasks that require precise\nvisual perception, like comparing geometric properties or solving visual\nreasoning problems. To study this failure mode, we focus on vector graphics --\nimages composed of 2D objects and shapes, prevalent in LMM-based tasks in web,\ndesign, and OS environments. We identify two key research questions: how can we\nenable precise visual perception, and how can we facilitate high-level\nreasoning based on such low-level perceptions? To capture fine visual details,\nwe use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes.\nHowever, SVGs are not readily interpretable by LMMs in a zero-shot manner. To\ntackle this, we propose the Visually Descriptive Language Model (VDLM), which\nintroduces a Primal Visual Description (PVD) as an intermediate textual\nrepresentation. PVD translates SVGs into a text-based abstraction consisting of\nprimitive attributes (e.g., shape, position, measurement) and their\ncorresponding values. PVD can be learned using task-agnostic synthesized data\nand represents visual primitives that are universal across vector graphics.\nThis abstraction is more structured, allowing for direct interpretation by\nfoundation models for zero-shot generalization. Without human-annotated data,\nempirical results show that VDLM significantly improves state-of-the-art LMMs\nlike GPT-4o on various multimodal perception and reasoning tasks. Extensive\nanalyses of VDLM show improved interpretability due to its disentangled\nperception and reasoning. We also demonstrate a positive correlation between\nPVD quality and task performance. Project page:\nhttps://mikewangwzhl.github.io/VDLM/", "comment": "Project page: https://mikewangwzhl.github.io/VDLM/", "links": []}
{"entry_id": "2410.02613", "title": "NL-Eye: Abductive NLI for Images", "authors": ["Mor Ventura", "Michael Toker", "Nitay Calderon", "Zorik Gekhman", "Yonatan Bitton", "Roi Reichart"], "published": "2024-10-03 15:51:36", "updated": "2024-10-03 15:51:36", "summary": "Will a Visual Language Model (VLM)-based bot warn us about slipping if it\ndetects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet\ntheir ability to infer outcomes and causes remains underexplored. To address\nthis, we introduce NL-Eye, a benchmark designed to assess VLMs' visual\nabductive reasoning skills. NL-Eye adapts the abductive Natural Language\nInference (NLI) task to the visual domain, requiring models to evaluate the\nplausibility of hypothesis images based on a premise image and explain their\ndecisions. NL-Eye consists of 350 carefully curated triplet examples (1,050\nimages) spanning diverse reasoning categories: physical, functional, logical,\nemotional, cultural, and social. The data curation process involved two steps -\nwriting textual descriptions and generating images using text-to-image models,\nboth requiring substantial human involvement to ensure high-quality and\nchallenging scenes. Our experiments show that VLMs struggle significantly on\nNL-Eye, often performing at random baseline levels, while humans excel in both\nplausibility prediction and explanation quality. This demonstrates a deficiency\nin the abductive reasoning capabilities of modern VLMs. NL-Eye represents a\ncrucial step toward developing VLMs capable of robust multimodal reasoning for\nreal-world applications, including accident-prevention bots and generated video\nverification.", "comment": null, "links": []}
{"entry_id": "2407.07053", "title": "Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model", "authors": ["Wenqi Zhang", "Zhenglin Cheng", "Yuanyu He", "Mengna Wang", "Yongliang Shen", "Zeqi Tan", "Guiyang Hou", "Mingqian He", "Yanna Ma", "Weiming Lu", "Yueting Zhuang"], "published": "2024-07-09 17:18:27", "updated": "2024-10-03 13:12:06", "summary": "Although most current large multimodal models (LMMs) can already understand\nphotos of natural scenes and portraits, their understanding of abstract images,\ne.g., charts, maps, or layouts, and visual reasoning capabilities remains quite\nrudimentary. They often struggle with simple daily tasks, such as reading time\nfrom a clock, understanding a flowchart, or planning a route using a road map.\nIn light of this, we design a multi-modal self-instruct, utilizing large\nlanguage models and their code capabilities to synthesize massive abstract\nimages and visual reasoning instructions across daily scenarios. Our strategy\neffortlessly creates a multimodal benchmark with 11,193 instructions for eight\nvisual scenarios: charts, tables, simulated maps, dashboards, flowcharts,\nrelation graphs, floor plans, and visual puzzles. \\textbf{This benchmark,\nconstructed with simple lines and geometric elements, exposes the shortcomings\nof most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image\nunderstanding, spatial relations reasoning, and visual element induction.\nBesides, to verify the quality of our synthetic data, we fine-tune an LMM using\n62,476 synthetic chart, table and road map instructions. The results\ndemonstrate improved chart understanding and map navigation performance, and\nalso demonstrate potential benefits for other visual reasoning tasks. Our code\nis available at: \\url{https://github.com/zwq2018/Multi-modal-Self-instruct}.", "comment": "The paper is accepted by EMNLP-24. Code:\n  https://github.com/zwq2018/Multi-modal-Self-instruct dataset:\n  https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct\n  Leaderboard: https://multi-modal-self-instruct.github.io/", "links": []}
{"entry_id": "2408.04102", "title": "ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling", "authors": ["William Yicheng Zhu", "Keren Ye", "Junjie Ke", "Jiahui Yu", "Leonidas Guibas", "Peyman Milanfar", "Feng Yang"], "published": "2024-08-07 21:44:29", "updated": "2024-10-02 12:48:44", "summary": "Recognizing and disentangling visual attributes from objects is a foundation\nto many computer vision applications. While large vision language\nrepresentations like CLIP had largely resolved the task of zero-shot object\nrecognition, zero-shot visual attribute recognition remains a challenge because\nCLIP's contrastively-learned vision-language representation cannot effectively\ncapture object-attribute dependencies. In this paper, we target this weakness\nand propose a sentence generation-based retrieval formulation for attribute\nrecognition that is novel in 1) explicitly modeling a to-be-measured and\nretrieved object-attribute relation as a conditional probability graph, which\nconverts the recognition problem into a dependency-sensitive language-modeling\nproblem, and 2) applying a large pretrained Vision-Language Model (VLM) on this\nreformulation and naturally distilling its knowledge of image-object-attribute\nrelations to use towards attribute recognition. Specifically, for each\nattribute to be recognized on an image, we measure the visual-conditioned\nprobability of generating a short sentence encoding the attribute's relation to\nobjects on the image. Unlike contrastive retrieval, which measures likelihood\nby globally aligning elements of the sentence to the image, generative\nretrieval is sensitive to the order and dependency of objects and attributes in\nthe sentence. We demonstrate through experiments that generative retrieval\nconsistently outperforms contrastive retrieval on two visual reasoning\ndatasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual\nGenome Attribute Ranking (VGARank).", "comment": "Accepted at ECCV 2024. Contact: zhuwilliam[at]google[dot]com. GitHub:\n  https://github.com/google-research/google-research/tree/master/attribute_with_prefixlm", "links": []}
{"entry_id": "2407.10341", "title": "Affordance-Guided Reinforcement Learning via Visual Prompting", "authors": ["Olivia Y. Lee", "Annie Xie", "Kuan Fang", "Karl Pertsch", "Chelsea Finn"], "published": "2024-07-14 21:41:29", "updated": "2024-10-02 00:40:38", "summary": "Robots equipped with reinforcement learning (RL) have the potential to learn\na wide range of skills solely from a reward signal. However, obtaining a robust\nand dense reward signal for general manipulation tasks remains a challenge.\nExisting learning-based approaches require significant data, such as human\ndemonstrations of success and failure, to learn task-specific reward functions.\nRecently, there is also a growing adoption of large multi-modal foundation\nmodels for robotics that can perform visual reasoning in physical contexts and\ngenerate coarse robot motions for manipulation tasks. Motivated by this range\nof capability, in this work, we present Keypoint-based Affordance Guidance for\nImprovements (KAGI), a method leveraging rewards shaped by vision-language\nmodels (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated\nimpressive reasoning about affordances through keypoints in zero-shot, and we\nuse these to define dense rewards that guide autonomous robotic learning. On\nreal-world manipulation tasks specified by natural language descriptions, KAGI\nimproves the sample efficiency of autonomous RL and enables successful task\ncompletion in 20K online fine-tuning steps. Additionally, we demonstrate the\nrobustness of KAGI to reductions in the number of in-domain demonstrations used\nfor pre-training, reaching similar performance in 35K online fine-tuning steps.\nProject website: https://sites.google.com/view/affordance-guided-rl", "comment": "8 pages, 6 figures. Robotics: Science and Systems (RSS) 2024, Task\n  Specification for General-Purpose Intelligent Robots & Lifelong Robot\n  Learning Workshops", "links": []}
{"entry_id": "2403.04732", "title": "How Far Are We from Intelligent Visual Deductive Reasoning?", "authors": ["Yizhe Zhang", "He Bai", "Ruixiang Zhang", "Jiatao Gu", "Shuangfei Zhai", "Josh Susskind", "Navdeep Jaitly"], "published": "2024-03-07 18:35:54", "updated": "2024-10-01 04:41:53", "summary": "Vision-Language Models (VLMs) have recently demonstrated incredible strides\non diverse vision language tasks. We dig into vision-based deductive reasoning,\na more sophisticated but less explored realm, and find previously unexposed\nblindspots in the current SOTA VLMs. Specifically, we leverage Raven's\nProgressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop\nrelational and deductive reasoning relying solely on visual clues. We perform\ncomprehensive evaluations of several popular VLMs employing standard strategies\nsuch as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on\nthree diverse datasets, including the Mensa IQ test, IntelligenceTest, and\nRAVEN. The results reveal that despite the impressive capabilities of LLMs in\ntext-based reasoning, we are still far from achieving comparable proficiency in\nvisual deductive reasoning. We found that certain standard strategies that are\neffective when applied to LLMs do not seamlessly translate to the challenges\npresented by visual reasoning tasks. A detailed analysis reveals that VLMs\nstruggle to solve these tasks mainly because they are unable to perceive and\ncomprehend multiple, confounding abstract patterns in RPM examples.", "comment": "COLM 2024. https://github.com/apple/ml-rpm-bench", "links": []}
{"entry_id": "2409.20213", "title": "Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning", "authors": ["Oleh Kolner", "Thomas Ortner", "Stanisław Woźniak", "Angeliki Pantazi"], "published": "2024-09-30 11:48:11", "updated": "2024-09-30 11:48:11", "summary": "Human capabilities in understanding visual relations are far superior to\nthose of AI systems, especially for previously unseen objects. For example,\nwhile AI systems struggle to determine whether two such objects are visually\nthe same or different, humans can do so with ease. Active vision theories\npostulate that the learning of visual relations is grounded in actions that we\ntake to fixate objects and their parts by moving our eyes. In particular, the\nlow-dimensional spatial information about the corresponding eye movements is\nhypothesized to facilitate the representation of relations between different\nimage parts. Inspired by these theories, we develop a system equipped with a\nnovel Glimpse-based Active Perception (GAP) that sequentially glimpses at the\nmost salient regions of the input image and processes them at high resolution.\nImportantly, our system leverages the locations stemming from the glimpsing\nactions, along with the visual content around them, to represent relations\nbetween different parts of the image. The results suggest that the GAP is\nessential for extracting visual relations that go beyond the immediate visual\ncontent. Our approach reaches state-of-the-art performance on several visual\nreasoning tasks being more sample-efficient, and generalizing better to\nout-of-distribution visual inputs than prior models.", "comment": "10 pages of main text and 8 pages appendices", "links": []}
{"entry_id": "2409.18938", "title": "From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding", "authors": ["Heqing Zou", "Tianze Luo", "Guiyang Xie", "Victor", "Zhang", "Fengmao Lv", "Guangcong Wang", "Juanyang Chen", "Zhuochen Wang", "Hansheng Zhang", "Huaijian Zhang"], "published": "2024-09-27 17:38:36", "updated": "2024-09-27 17:38:36", "summary": "The integration of Large Language Models (LLMs) with visual encoders has\nrecently shown promising performance in visual understanding tasks, leveraging\ntheir inherent capability to comprehend and generate human-like text for visual\nreasoning. Given the diverse nature of visual data, MultiModal Large Language\nModels (MM-LLMs) exhibit variations in model designing and training for\nunderstanding images, short videos, and long videos. Our paper focuses on the\nsubstantial differences and unique challenges posed by long video understanding\ncompared to static image and short video understanding. Unlike static images,\nshort videos encompass sequential frames with both spatial and within-event\ntemporal information, while long videos consist of multiple events with\nbetween-event and long-term temporal information. In this survey, we aim to\ntrace and summarize the advancements of MM-LLMs from image understanding to\nlong video understanding. We review the differences among various visual\nunderstanding tasks and highlight the challenges in long video understanding,\nincluding more fine-grained spatiotemporal details, dynamic events, and\nlong-term dependencies. We then provide a detailed summary of the advancements\nin MM-LLMs in terms of model design and training methodologies for\nunderstanding long videos. Finally, we compare the performance of existing\nMM-LLMs on video understanding benchmarks of various lengths and discuss\npotential future directions for MM-LLMs in long video understanding.", "comment": "11 pages", "links": []}
{"entry_id": "2409.18286", "title": "Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing", "authors": ["Huthaifa I. Ashqar", "Ahmed Jaber", "Taqwa I. Alhadidi", "Mohammed Elhenawy"], "published": "2024-09-26 20:58:11", "updated": "2024-09-26 20:58:11", "summary": "This study aims to comprehensively review and empirically evaluate the\napplication of multimodal large language models (MLLMs) and Large Vision Models\n(VLMs) in object detection for transportation systems. In the first fold, we\nprovide a background about the potential benefits of MLLMs in transportation\napplications and conduct a comprehensive review of current MLLM technologies in\nprevious studies. We highlight their effectiveness and limitations in object\ndetection within various transportation scenarios. The second fold involves\nproviding an overview of the taxonomy of end-to-end object detection in\ntransportation applications and future directions. Building on this, we\nproposed empirical analysis for testing MLLMs on three real-world\ntransportation problems that include object detection tasks namely, road safety\nattributes extraction, safety-critical event detection, and visual reasoning of\nthermal images. Our findings provide a detailed assessment of MLLM performance,\nuncovering both strengths and areas for improvement. Finally, we discuss\npractical limitations and challenges of MLLMs in enhancing object detection in\ntransportation, thereby offering a roadmap for future research and development\nin this critical area.", "comment": null, "links": []}
{"entry_id": "2409.18084", "title": "GSON: A Group-based Social Navigation Framework with Large Multimodal Model", "authors": ["Shangyi Luo", "Ji Zhu", "Peng Sun", "Yuhong Deng", "Cunjun Yu", "Anxing Xiao", "Xueqian Wang"], "published": "2024-09-26 17:27:15", "updated": "2024-09-26 17:27:15", "summary": "As the number of service robots and autonomous vehicles in human-centered\nenvironments grows, their requirements go beyond simply navigating to a\ndestination. They must also take into account dynamic social contexts and\nensure respect and comfort for others in shared spaces, which poses significant\nchallenges for perception and planning. In this paper, we present a group-based\nsocial navigation framework GSON to enable mobile robots to perceive and\nexploit the social group of their surroundings by leveling the visual reasoning\ncapability of the Large Multimodal Model (LMM). For perception, we apply visual\nprompting techniques to zero-shot extract the social relationship among\npedestrians and combine the result with a robust pedestrian detection and\ntracking pipeline to alleviate the problem of low inference speed of the LMM.\nGiven the perception result, the planning system is designed to avoid\ndisrupting the current social structure. We adopt a social structure-based\nmid-level planner as a bridge between global path planning and local motion\nplanning to preserve the global context and reactive response. The proposed\nmethod is validated on real-world mobile robot navigation tasks involving\ncomplex social structure understanding and reasoning. Experimental results\ndemonstrate the effectiveness of the system in these scenarios compared with\nseveral baselines.", "comment": null, "links": []}
{"entry_id": "2404.09486", "title": "MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems", "authors": ["Kaixin Li", "Yuchen Tian", "Qisheng Hu", "Ziyang Luo", "Zhiyong Huang", "Jing Ma"], "published": "2024-04-15 06:15:46", "updated": "2024-09-26 09:31:48", "summary": "Programming often involves converting detailed and complex specifications\ninto code, a process during which developers typically utilize visual aids to\nmore effectively convey concepts. While recent developments in Large Multimodal\nModels have demonstrated remarkable abilities in visual reasoning and\nmathematical tasks, there is little work on investigating whether these models\ncan effectively interpret visual elements for code generation. To this end, we\npresent MMCode, the first multi-modal coding dataset for evaluating algorithmic\nproblem-solving skills in visually rich contexts. MMCode contains 3,548\nquestions and 6,620 images collected from real-world programming challenges\nharvested from 10 code competition websites, presenting significant challenges\ndue to the extreme demand for reasoning abilities. Our experiment results show\nthat current state-of-the-art models struggle to solve these problems. The\nresults highlight the lack of powerful vision-code models, and we hope MMCode\ncan serve as an inspiration for future works in this domain. The data and code\nare publicly available at https://github.com/likaixin2000/MMCode.", "comment": "EMNLP 2024", "links": []}
{"entry_id": "2406.00307", "title": "HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model", "authors": ["Khoa Vo", "Thinh Phan", "Kashu Yamazaki", "Minh Tran", "Ngan Le"], "published": "2024-06-01 05:41:12", "updated": "2024-09-25 19:17:53", "summary": "Current video-language models (VLMs) rely extensively on instance-level\nalignment between video and language modalities, which presents two major\nlimitations: (1) visual reasoning disobeys the natural perception that humans\ndo in first-person perspective, leading to a lack of reasoning interpretation;\nand (2) learning is limited in capturing inherent fine-grained relationships\nbetween two modalities.\n  In this paper, we take an inspiration from human perception and explore a\ncompositional approach for egocentric video representation. We introduce HENASY\n(Hierarchical ENtities ASsemblY), which includes a spatiotemporal token\ngrouping mechanism to explicitly assemble dynamically evolving scene entities\nthrough time and model their relationship for video representation. By\nleveraging compositional structure understanding, HENASY possesses strong\ninterpretability via visual grounding with free-form text queries. We further\nexplore a suite of multi-grained contrastive losses to facilitate\nentity-centric understandings. This comprises three alignment types:\nvideo-narration, noun-entity, verb-entities alignments.\n  Our method demonstrates strong interpretability in both quantitative and\nqualitative experiments; while maintaining competitive performances on five\ndownstream tasks via zero-shot transfer or as video/text representation,\nincluding video/text retrieval, action recognition, multi-choice query, natural\nlanguage query, and moments query.", "comment": "Accepted in NeurIPS 2024", "links": []}
{"entry_id": "2409.12953", "title": "JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images", "authors": ["Zhecan Wang", "Junzhang Liu", "Chia-Wei Tang", "Hani Alomari", "Anushka Sivakumar", "Rui Sun", "Wenhao Li", "Md. Atabuzzaman", "Hammad Ayyubi", "Haoxuan You", "Alvi Ishmam", "Kai-Wei Chang", "Shih-Fu Chang", "Chris Thomas"], "published": "2024-09-19 17:58:16", "updated": "2024-09-25 01:46:10", "summary": "Existing vision-language understanding benchmarks largely consist of images\nof objects in their usual contexts. As a consequence, recent multimodal large\nlanguage models can perform well with only a shallow visual understanding by\nrelying on background language biases. Thus, strong performance on these\nbenchmarks does not necessarily correlate with strong visual understanding. In\nthis paper, we release JourneyBench, a comprehensive human-annotated benchmark\nof generated images designed to assess the model's fine-grained multimodal\nreasoning abilities across five tasks: complementary multimodal chain of\nthought, multi-image VQA, imaginary image captioning, VQA with hallucination\ntriggers, and fine-grained retrieval with sample-specific distractors. Unlike\nexisting benchmarks, JourneyBench explicitly requires fine-grained multimodal\nreasoning in unusual imaginary scenarios where language bias and holistic image\ngist are insufficient. We benchmark state-of-the-art models on JourneyBench and\nanalyze performance along a number of fine-grained dimensions. Results across\nall five tasks show that JourneyBench is exceptionally challenging for even the\nbest models, indicating that models' visual reasoning abilities are not as\nstrong as they first appear. We discuss the implications of our findings and\npropose avenues for further research.", "comment": null, "links": []}
{"entry_id": "2409.15505", "title": "Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs", "authors": ["Angelos Mavrogiannis", "Dehao Yuan", "Yiannis Aloimonos"], "published": "2024-09-23 19:50:33", "updated": "2024-09-23 19:50:33", "summary": "There has been a lot of interest in grounding natural language to physical\nentities through visual context. While Vision Language Models (VLMs) can ground\nlinguistic instructions to visual sensory information, they struggle with\ngrounding non-visual attributes, like the weight of an object. Our key insight\nis that non-visual attribute detection can be effectively achieved by active\nperception guided by visual reasoning. To this end, we present a\nperception-action programming API that consists of VLMs and Large Language\nModels (LLMs) as backbones, together with a set of robot control functions.\nWhen prompted with this API and a natural language query, an LLM generates a\nprogram to actively identify attributes given an input image. Offline testing\non the Odd-One-Out dataset demonstrates that our framework outperforms vanilla\nVLMs in detecting attributes like relative object location, size, and weight.\nOnline testing in realistic household scenes on AI2-THOR and a real robot\ndemonstration on a DJI RoboMaster EP robot highlight the efficacy of our\napproach.", "comment": null, "links": []}
{"entry_id": "2409.14750", "title": "FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension", "authors": ["Junzhuo Liu", "Xuzheng Yang", "Weiwei Li", "Peng Wang"], "published": "2024-09-23 06:56:51", "updated": "2024-09-23 06:56:51", "summary": "Referring Expression Comprehension (REC) is a crucial cross-modal task that\nobjectively evaluates the capabilities of language understanding, image\ncomprehension, and language-to-image grounding. Consequently, it serves as an\nideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit\nof this goal, we have established a new REC dataset characterized by two key\nfeatures: Firstly, it is designed with controllable varying levels of\ndifficulty, necessitating multi-level fine-grained reasoning across object\ncategories, attributes, and multi-hop relationships. Secondly, it includes\nnegative text and images created through fine-grained editing and generation\nbased on existing data, thereby testing the model's ability to correctly reject\nscenarios where the target object is not visible in the image--an essential\naspect often overlooked in existing datasets and approaches. Utilizing this\nhigh-quality dataset, we conducted comprehensive evaluations of both\nstate-of-the-art specialist models and MLLMs. Our findings indicate that there\nremains a significant gap in achieving satisfactory grounding performance. We\nanticipate that our dataset will inspire new approaches to enhance visual\nreasoning and develop more advanced cross-modal interaction strategies,\nultimately unlocking the full potential of MLLMs. Our code and the datasets are\navailable at https://github.com/liujunzhuo/FineCops-Ref.", "comment": "19 pages, EMNLP 2024", "links": []}
{"entry_id": "2402.14899", "title": "Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image", "authors": ["Zefeng Wang", "Zhen Han", "Shuo Chen", "Fan Xue", "Zifeng Ding", "Xun Xiao", "Volker Tresp", "Philip Torr", "Jindong Gu"], "published": "2024-02-22 17:36:34", "updated": "2024-09-22 14:46:30", "summary": "Multimodal LLMs (MLLMs) with a great ability of text and image understanding\nhave received great attention. To achieve better reasoning with MLLMs,\nChain-of-Thought (CoT) reasoning has been widely explored, which further\npromotes MLLMs' explainability by giving intermediate reasoning steps. Despite\nthe strong power demonstrated by MLLMs in multimodal reasoning, recent studies\nshow that MLLMs still suffer from adversarial images. This raises the following\nopen questions: Does CoT also enhance the adversarial robustness of MLLMs? What\ndo the intermediate reasoning steps of CoT entail under adversarial attacks? To\nanswer these questions, we first generalize existing attacks to CoT-based\ninferences by attacking the two main components, i.e., rationale and answer. We\nfind that CoT indeed improves MLLMs' adversarial robustness against the\nexisting attack methods by leveraging the multi-step reasoning process, but not\nsubstantially. Based on our findings, we further propose a novel attack method,\ntermed as stop-reasoning attack, that attacks the model while bypassing the CoT\nreasoning process. Experiments on three MLLMs and two visual reasoning datasets\nverify the effectiveness of our proposed method. We show that stop-reasoning\nattack can result in misled predictions and outperform baseline attacks by a\nsignificant margin.", "comment": null, "links": []}
{"entry_id": "2409.13980", "title": "Enhancing Advanced Visual Reasoning Ability of Large Language Models", "authors": ["Zhiyuan Li", "Dongnan Liu", "Chaoyi Zhang", "Heng Wang", "Tengfei Xue", "Weidong Cai"], "published": "2024-09-21 02:10:19", "updated": "2024-09-21 02:10:19", "summary": "Recent advancements in Vision-Language (VL) research have sparked new\nbenchmarks for complex visual reasoning, challenging models' advanced reasoning\nability. Traditional Vision-Language Models (VLMs) perform well in visual\nperception tasks while struggling with complex reasoning scenarios. Conversely,\nLarge Language Models (LLMs) demonstrate robust text reasoning capabilities;\nhowever, they lack visual acuity. To bridge this gap, we propose Complex Visual\nReasoning Large Language Models (CVR-LLM), capitalizing on VLMs' visual\nperception proficiency and LLMs' extensive reasoning capability. Unlike recent\nmultimodal large language models (MLLMs) that require a projection layer, our\napproach transforms images into detailed, context-aware descriptions using an\niterative self-refinement loop and leverages LLMs' text knowledge for accurate\npredictions without extra training. We also introduce a novel multi-modal\nin-context learning (ICL) methodology to enhance LLMs' contextual understanding\nand reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a\nstep-by-step comparison technique enabling contrasting various aspects of\npredictions. Our CVR-LLM presents the first comprehensive study across a wide\narray of complex visual reasoning tasks and achieves SOTA performance among\nall.", "comment": "EMNLP 2024 Main", "links": []}
{"entry_id": "2409.12878", "title": "Impact of ML Optimization Tactics on Greener Pre-Trained ML Models", "authors": ["Alexandra González Álvarez", "Joel Castaño", "Xavier Franch", "Silverio Martínez-Fernández"], "published": "2024-09-19 16:23:03", "updated": "2024-09-19 16:23:03", "summary": "Background: Given the fast-paced nature of today's technology, which has\nsurpassed human performance in tasks like image classification, visual\nreasoning, and English understanding, assessing the impact of Machine Learning\n(ML) on energy consumption is crucial. Traditionally, ML projects have\nprioritized accuracy over energy, creating a gap in energy consumption during\nmodel inference.\n  Aims: This study aims to (i) analyze image classification datasets and\npre-trained models, (ii) improve inference efficiency by comparing optimized\nand non-optimized models, and (iii) assess the economic impact of the\noptimizations.\n  Method: We conduct a controlled experiment to evaluate the impact of various\nPyTorch optimization techniques (dynamic quantization, torch.compile, local\npruning, and global pruning) to 42 Hugging Face models for image\nclassification. The metrics examined include GPU utilization, power and energy\nconsumption, accuracy, time, computational complexity, and economic costs. The\nmodels are repeatedly evaluated to quantify the effects of these software\nengineering tactics.\n  Results: Dynamic quantization demonstrates significant reductions in\ninference time and energy consumption, making it highly suitable for\nlarge-scale systems. Additionally, torch.compile balances accuracy and energy.\nIn contrast, local pruning shows no positive impact on performance, and global\npruning's longer optimization times significantly impact costs.\n  Conclusions: This study highlights the role of software engineering tactics\nin achieving greener ML models, offering guidelines for practitioners to make\ninformed decisions on optimization methods that align with sustainability\ngoals.", "comment": null, "links": []}
{"entry_id": "2409.05840", "title": "MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct", "authors": ["Run Luo", "Haonan Zhang", "Longze Chen", "Ting-En Lin", "Xiong Liu", "Yuchuan Wu", "Min Yang", "Minzheng Wang", "Pengpeng Zeng", "Lianli Gao", "Heng Tao Shen", "Yunshui Li", "Xiaobo Xia", "Fei Huang", "Jingkuan Song", "Yongbin Li"], "published": "2024-09-09 17:44:00", "updated": "2024-09-19 16:17:38", "summary": "The development of Multimodal Large Language Models (MLLMs) has seen\nsignificant advancements with increasing demands in various fields (e.g.,\nmultimodal agents, embodied intelligence). While model-driven approaches\nattempt to enhance MLLMs capabilities through diverse architectures, the gains\nhave become increasingly marginal. Conversely, data-driven methods, which scale\nup image-text instruction data, are more effective but face limited data\ndiversity and complexity challenges. The absence of high-quality data\nconstitutes a significant development barrier for MLLMs. To address the data\nquality bottleneck, we propose MMEvol, a novel multimodal instruction data\nevolution framework. This framework iteratively improve data quality through a\nrefined combination of fine-grained perception, cognitive reasoning, and\ninteraction evolution, generating a more complex and diverse image-text\ninstruction dataset that empowers MLLMs with enhanced capabilities. Beginning\nwith an initial set of instructions, SEED-163K, we utilize MMEvol to\nsystematically broaden the diversity of instruction types, extend visual\nreasoning steps to improve cognitive reasoning abilities, and thoroughly\nexplore fine-grained information within images to enhance visual understanding\nand robustness. To comprehensively evaluate the effectiveness of our approach,\nwe conduct extensive qualitative analysis and quantitative experiments across\n13 vision-language tasks. Compared to baseline models trained with the initial\nseed data, the results demonstrate that our method achieves an average accuracy\nimprovement of 3.1 percentage points. Furthermore, our approach reaches\nstate-of-the-art (SOTA) performance in nine tasks using significantly less data\ncompared to state-of-the-art models.", "comment": null, "links": []}
{"entry_id": "2409.08202", "title": "What Makes a Maze Look Like a Maze?", "authors": ["Joy Hsu", "Jiayuan Mao", "Joshua B. Tenenbaum", "Noah D. Goodman", "Jiajun Wu"], "published": "2024-09-12 16:41:47", "updated": "2024-09-12 16:41:47", "summary": "A unique aspect of human visual understanding is the ability to flexibly\ninterpret abstract concepts: acquiring lifted rules explaining what they\nsymbolize, grounding them across familiar and unfamiliar contexts, and making\npredictions or reasoning about them. While off-the-shelf vision-language models\nexcel at making literal interpretations of images (e.g., recognizing object\ncategories such as tree branches), they still struggle to make sense of such\nvisual abstractions (e.g., how an arrangement of tree branches may form the\nwalls of a maze). To address this challenge, we introduce Deep Schema Grounding\n(DSG), a framework that leverages explicit structured representations of visual\nabstractions for grounding and reasoning. At the core of DSG are\nschemas--dependency graph descriptions of abstract concepts that decompose them\ninto more primitive-level symbols. DSG uses large language models to extract\nschemas, then hierarchically grounds concrete to abstract components of the\nschema onto images with vision-language models. The grounded schema is used to\naugment visual abstraction understanding. We systematically evaluate DSG and\ndifferent methods in reasoning on our new Visual Abstractions Dataset, which\nconsists of diverse, real-world images of abstract concepts and corresponding\nquestion-answer pairs labeled by humans. We show that DSG significantly\nimproves the abstract visual reasoning performance of vision-language models,\nand is a step toward human-aligned understanding of visual abstractions.", "comment": null, "links": []}
{"entry_id": "2407.14133", "title": "I Know About \"Up\"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction", "authors": ["Zaiqiao Meng", "Hao Zhou", "Yifang Chen"], "published": "2024-07-19 09:03:30", "updated": "2024-09-12 11:17:46", "summary": "Visual Language Models (VLMs) are essential for various tasks, particularly\nvisual reasoning tasks, due to their robust multi-modal information\nintegration, visual reasoning capabilities, and contextual awareness. However,\nexisting \\VLMs{}' visual spatial reasoning capabilities are often inadequate,\nstruggling even with basic tasks such as distinguishing left from right. To\naddress this, we propose the \\ours{} model, designed to enhance the visual\nspatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D\nreconstruction model for obtaining different views of the input images and\nincorporates a prompting mechanism to further improve visual spatial reasoning.\nExperimental results on four visual spatial reasoning datasets show that our\n\\ours{} achieves up to 19.48% accuracy improvement, which indicates the\neffectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.", "comment": null, "links": []}
{"entry_id": "2409.06638", "title": "Critical Features Tracking on Triangulated Irregular Networks by a Scale-Space Method", "authors": ["Haoan Feng", "Yunting Song", "Leila De Floriani"], "published": "2024-09-10 16:48:05", "updated": "2024-09-10 16:48:05", "summary": "The scale-space method is a well-established framework that constructs a\nhierarchical representation of an input signal and facilitates coarse-to-fine\nvisual reasoning. Considering the terrain elevation function as the input\nsignal, the scale-space method can identify and track significant topographic\nfeatures across different scales. The number of scales a feature persists,\ncalled its life span, indicates the importance of that feature. In this way,\nimportant topographic features of a landscape can be selected, which are useful\nfor many applications, including cartography, nautical charting, and land-use\nplanning. The scale-space methods developed for terrain data use gridded\nDigital Elevation Models (DEMs) to represent the terrain. However, gridded DEMs\nlack the flexibility to adapt to the irregular distribution of input data and\nthe varied topological complexity of different regions. Instead, Triangulated\nIrregular Networks (TINs) can be directly generated from irregularly\ndistributed point clouds and accurately preserve important features. In this\nwork, we introduce a novel scale-space analysis pipeline for TINs, addressing\nthe multiple challenges in extending grid-based scale-space methods to TINs.\nOur pipeline can efficiently identify and track topologically important\nfeatures on TINs. Moreover, it is capable of analyzing terrains with irregular\nboundaries, which poses challenges for grid-based methods. Comprehensive\nexperiments show that, compared to grid-based methods, our TIN-based pipeline\nis more efficient, accurate, and has better resolution robustness.", "comment": "13pages, ACM SIGSPATIAL 2024", "links": ["http://dx.doi.org/10.1145/3678717.3691218"]}
{"entry_id": "2311.18799", "title": "X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning", "authors": ["Artemis Panagopoulou", "Le Xue", "Ning Yu", "Junnan Li", "Dongxu Li", "Shafiq Joty", "Ran Xu", "Silvio Savarese", "Caiming Xiong", "Juan Carlos Niebles"], "published": "2023-11-30 18:43:51", "updated": "2024-09-09 16:00:04", "summary": "Recent research has achieved significant advancements in visual reasoning\ntasks through learning image-to-language projections and leveraging the\nimpressive reasoning abilities of Large Language Models (LLMs). This paper\nintroduces an efficient and effective framework that integrates multiple\nmodalities (images, 3D, audio and video) to a frozen LLM and demonstrates an\nemergent ability for cross-modal reasoning (2+ modality inputs). Our approach\nexplores two distinct projection mechanisms: Q-Formers and Linear Projections\n(LPs). Through extensive experimentation across all four modalities on 16\nbenchmarks, we explore both methods and assess their adaptability in integrated\nand separate cross-modal reasoning. The Q-Former projection demonstrates\nsuperior performance in single modality scenarios and adaptability in joint\nversus discriminative reasoning involving two or more modalities. However, it\nexhibits lower generalization capabilities than linear projection in contexts\nwhere task-modality data are limited. To enable this framework, we devise a\nscalable pipeline that automatically generates high-quality, instruction-tuning\ndatasets from readily available captioning data across different modalities,\nand contribute 24K QA data for audio and 250K QA data for 3D. To facilitate\nfurther research in cross-modal reasoning, we introduce the DisCRn\n(Discriminative Cross-modal Reasoning) benchmark comprising 9K audio-video QA\nsamples and 28K image-3D QA samples that require the model to reason\ndiscriminatively across disparate input modalities.", "comment": null, "links": []}
{"entry_id": "2407.02392", "title": "TokenPacker: Efficient Visual Projector for Multimodal LLM", "authors": ["Wentong Li", "Yuqian Yuan", "Jian Liu", "Dongqi Tang", "Song Wang", "Jie Qin", "Jianke Zhu", "Lei Zhang"], "published": "2024-07-02 16:10:55", "updated": "2024-08-28 08:49:57", "summary": "The visual projector serves as an essential bridge between the visual encoder\nand the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs\nadopt a simple MLP to preserve all visual contexts via one-to-one\ntransformation. However, the visual tokens are redundant and can be\nconsiderably increased when dealing with high-resolution images, impairing the\nefficiency of MLLMs significantly. Some recent works have introduced resampler\nor abstractor to reduce the number of resulting visual tokens. Unfortunately,\nthey fail to capture finer details and undermine the visual reasoning\ncapabilities of MLLMs. In this work, we propose a novel visual projector, which\nadopts a coarse-to-fine scheme to inject the enriched characteristics to\ngenerate the condensed visual tokens. In specific, we first interpolate the\nvisual features as a low-resolution point query, providing the overall visual\nrepresentation as the foundation. Then, we introduce a region-to-point\ninjection module that utilizes high-resolution, multi-level region-based cues\nas fine-grained reference keys and values, allowing them to be fully absorbed\nwithin the corresponding local context region. This step effectively updates\nthe coarse point query, transforming it into an enriched one for the subsequent\nLLM reasoning. Extensive experiments demonstrate that our approach compresses\nthe visual tokens by 75%~89%, while achieves comparable or even better\nperformance across diverse benchmarks with significantly higher efficiency. The\nsource codes can be found at https://github.com/CircleRadon/TokenPacker.", "comment": "16 pages, Codes:https://github.com/CircleRadon/TokenPacker", "links": []}
{"entry_id": "2409.00106", "title": "Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis", "authors": ["Aishik Nagar", "Shantanu Jaiswal", "Cheston Tan"], "published": "2024-08-27 14:43:54", "updated": "2024-08-27 14:43:54", "summary": "Vision-language models (VLMs) have shown impressive zero- and few-shot\nperformance on real-world visual question answering (VQA) benchmarks, alluding\nto their capabilities as visual reasoning engines. However, the benchmarks\nbeing used conflate \"pure\" visual reasoning with world knowledge, and also have\nquestions that involve a limited number of reasoning steps. Thus, it remains\nunclear whether a VLM's apparent visual reasoning performance is due to its\nworld knowledge, or due to actual visual reasoning capabilities.\n  To clarify this ambiguity, we systematically benchmark and dissect the\nzero-shot visual reasoning capabilities of VLMs through synthetic datasets that\nrequire minimal world knowledge, and allow for analysis over a broad range of\nreasoning steps. We focus on two novel aspects of zero-shot visual reasoning:\ni) evaluating the impact of conveying scene information as either visual\nembeddings or purely textual scene descriptions to the underlying large\nlanguage model (LLM) of the VLM, and ii) comparing the effectiveness of\nchain-of-thought prompting to standard prompting for zero-shot visual\nreasoning.\n  We find that the underlying LLMs, when provided textual scene descriptions,\nconsistently perform better compared to being provided visual embeddings. In\nparticular, 18% higher accuracy is achieved on the PTR dataset. We also find\nthat CoT prompting performs marginally better than standard prompting only for\nthe comparatively large GPT-3.5-Turbo (175B) model, and does worse for\nsmaller-scale models. This suggests the emergence of CoT abilities for visual\nreasoning in LLMs at larger scales even when world knowledge is limited.\nOverall, we find limitations in the abilities of VLMs and LLMs for more complex\nvisual reasoning, and highlight the important role that LLMs can play in visual\nreasoning.", "comment": "21 pages", "links": []}
{"entry_id": "2403.03190", "title": "Triple-CFN: Separating Concept and Feature Extraction Enhances Machine Abstract Reasoning Ability", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-03-05 18:29:17", "updated": "2024-08-23 10:30:55", "summary": "Visual abstract reasoning poses challenges to AI algorithms, requiring\ncognitive abilities beyond perception. For methodology, this study emphasizes\nthe need to separately extract concepts and features from visual abstract\nreasoning problems, employing the responses of features to concepts as elements\nin the reasoning process. For technology, we introduce the Cross-Feature\nNetwork (CFN), a framework that separately extracts concepts and features from\nreasoning problems, utilizing their responses as reasoning representations. The\nCFN integrates a dual Expectation-Maximization process to actively seek an\nideal concept space for problem-solving, yielding notable results despite\nlimitations in generalization tasks. To overcome these limitations, we propose\nthe Triple-CFN, enhancing feature extraction and demonstrating effectiveness in\nBongard-Logo and Raven's Progressive Matrices (RPM) problems. Additionally, we\npresent Meta Triple-CFN, which constructs a promising concept space for RPM,\nensuring high reasoning accuracy and concept interpretability. Furthermore, we\ndesign the Re-space layer, defining a clear feature space for (Meta)\nTriple-CFN, with its unique warm-start process aiding generalization. Overall,\nthis work advances machine intelligence through innovative network designs for\nabstract reasoning.", "comment": "13 pages, 15 figures, 7 tables", "links": []}
{"entry_id": "2406.12272", "title": "Slot State Space Models", "authors": ["Jindong Jiang", "Fei Deng", "Gautam Singh", "Minseung Lee", "Sungjin Ahn"], "published": "2024-06-18 04:59:14", "updated": "2024-08-21 20:54:33", "summary": "Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown\nremarkable computational benefits in long-range temporal dependency modeling.\nHowever, in many sequence modeling problems, the underlying process is\ninherently modular and it is of interest to have inductive biases that mimic\nthis modular structure. In this paper, we introduce SlotSSMs, a novel framework\nfor incorporating independent mechanisms into SSMs to preserve or encourage\nseparation of information. Unlike conventional SSMs that maintain a monolithic\nstate vector, SlotSSMs maintains the state as a collection of multiple vectors\ncalled slots. Crucially, the state transitions are performed independently per\nslot with sparse interactions across slots implemented via the bottleneck of\nself-attention. In experiments, we evaluate our model in object-centric video\nunderstanding, 3D visual reasoning, and video prediction tasks, which involve\nmodeling multiple objects and their long-range temporal dependencies. We find\nthat our proposed design offers substantial performance gains over existing\nsequence modeling methods. Project page is available at\nhttps://slotssms.github.io/", "comment": null, "links": []}
{"entry_id": "2408.08431", "title": "Multi-Modal Dialogue State Tracking for Playing GuessWhich Game", "authors": ["Wei Pang", "Ruixue Duan", "Jinfu Yang", "Ning Li"], "published": "2024-08-15 21:46:19", "updated": "2024-08-15 21:46:19", "summary": "GuessWhich is an engaging visual dialogue game that involves interaction\nbetween a Questioner Bot (QBot) and an Answer Bot (ABot) in the context of\nimage-guessing. In this game, QBot's objective is to locate a concealed image\nsolely through a series of visually related questions posed to ABot. However,\neffectively modeling visually related reasoning in QBot's decision-making\nprocess poses a significant challenge. Current approaches either lack visual\ninformation or rely on a single real image sampled at each round as decoding\ncontext, both of which are inadequate for visual reasoning. To address this\nlimitation, we propose a novel approach that focuses on visually related\nreasoning through the use of a mental model of the undisclosed image. Within\nthis framework, QBot learns to represent mental imagery, enabling robust visual\nreasoning by tracking the dialogue state. The dialogue state comprises a\ncollection of representations of mental imagery, as well as representations of\nthe entities involved in the conversation. At each round, QBot engages in\nvisually related reasoning using the dialogue state to construct an internal\nrepresentation, generate relevant questions, and update both the dialogue state\nand internal representation upon receiving an answer. Our experimental results\non the VisDial datasets (v0.5, 0.9, and 1.0) demonstrate the effectiveness of\nour proposed model, as it achieves new state-of-the-art performance across all\nmetrics and datasets, surpassing previous state-of-the-art models. Codes and\ndatasets from our experiments are freely available at\n\\href{https://github.com/xubuvd/GuessWhich}.", "comment": "Published at CICAI 2023 (CAAI-A), codes at\n  https://github.com/xubuvd/GuessWhich", "links": ["http://dx.doi.org/10.1007/978-981-99-8850-1_45"]}
{"entry_id": "2406.07546", "title": "Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?", "authors": ["Xingyu Fu", "Muyu He", "Yujie Lu", "William Yang Wang", "Dan Roth"], "published": "2024-06-11 17:59:48", "updated": "2024-08-12 19:33:52", "summary": "We present a novel task and benchmark for evaluating the ability of\ntext-to-image(T2I) generation models to produce images that align with\ncommonsense in real life, which we call Commonsense-T2I. Given two adversarial\ntext prompts containing an identical set of action words with minor\ndifferences, such as \"a lightbulb without electricity\" v.s. \"a lightbulb with\nelectricity\", we evaluate whether T2I models can conduct visual-commonsense\nreasoning, e.g. produce images that fit \"the lightbulb is unlit\" vs. \"the\nlightbulb is lit\" correspondingly. Commonsense-T2I presents an adversarial\nchallenge, providing pairwise text prompts along with expected outputs. The\ndataset is carefully hand-curated by experts and annotated with fine-grained\nlabels, such as commonsense type and likelihood of the expected outputs, to\nassist analyzing model behavior. We benchmark a variety of state-of-the-art\n(sota) T2I models and surprisingly find that, there is still a large gap\nbetween image synthesis and real life photos--even the DALL-E 3 model could\nonly achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only\nachieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot\nsolve this challenge, and we include a detailed analysis about possible reasons\nfor such deficiency. We aim for Commonsense-T2I to serve as a high-quality\nevaluation benchmark for T2I commonsense checking, fostering advancements in\nreal life image generation.", "comment": "COLM 2024, Project Url: https://zeyofu.github.io/CommonsenseT2I/", "links": []}
{"entry_id": "2408.05924", "title": "Adapting a Foundation Model for Space-based Tasks", "authors": ["Matthew Foutter", "Praneet Bhoj", "Rohan Sinha", "Amine Elhafsi", "Somrita Banerjee", "Christopher Agia", "Justin Kruger", "Tommaso Guffanti", "Daniele Gammelli", "Simone D'Amico", "Marco Pavone"], "published": "2024-08-12 05:07:24", "updated": "2024-08-12 05:07:24", "summary": "Foundation models, e.g., large language models, possess attributes of\nintelligence which offer promise to endow a robot with the contextual\nunderstanding necessary to navigate complex, unstructured tasks in the wild. In\nthe future of space robotics, we see three core challenges which motivate the\nuse of a foundation model adapted to space-based applications: 1) Scalability\nof ground-in-the-loop operations; 2) Generalizing prior knowledge to novel\nenvironments; and 3) Multi-modality in tasks and sensor data. Therefore, as a\nfirst-step towards building a foundation model for space-based applications, we\nautomatically label the AI4Mars dataset to curate a language annotated dataset\nof visual-question-answer tuples. We fine-tune a pretrained LLaVA checkpoint on\nthis dataset to endow a vision-language model with the ability to perform\nspatial reasoning and navigation on Mars' surface. In this work, we demonstrate\nthat 1) existing vision-language models are deficient visual reasoners in\nspace-based applications, and 2) fine-tuning a vision-language model on\nextraterrestrial data significantly improves the quality of responses even with\na limited training dataset of only a few thousand samples.", "comment": null, "links": []}
{"entry_id": "2308.03729", "title": "TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models", "authors": ["Wenqi Shao", "Meng Lei", "Yutao Hu", "Peng Gao", "Kaipeng Zhang", "Fanqing Meng", "Peng Xu", "Siyuan Huang", "Hongsheng Li", "Yu Qiao", "Ping Luo"], "published": "2023-08-07 17:17:05", "updated": "2024-08-10 08:51:58", "summary": "Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated\nsignificant progress in tackling complex multimodal tasks. Among these\ncutting-edge developments, Google's Bard stands out for its remarkable\nmultimodal capabilities, promoting comprehensive comprehension and reasoning\nacross various domains. This work presents an early and holistic evaluation of\nLVLMs' multimodal abilities, with a particular focus on Bard, by proposing a\nlightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the\nvanilla version, Tiny LVLM-eHub possesses several appealing properties.\nFirstly, it provides a systematic assessment of six categories of multimodal\ncapabilities, including visual perception, visual knowledge acquisition, visual\nreasoning, visual commonsense, object hallucination, and embodied intelligence,\nthrough quantitative evaluation of $42$ standard text-related visual\nbenchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions\nusing the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and\naccurate evaluation and exhibits improved alignment with human evaluation\ncompared to the word matching approach. Thirdly, it comprises a mere $2.1$K\nimage-text pairs, facilitating ease of use for practitioners to evaluate their\nown offline LVLMs. Through extensive experimental analysis, this study\ndemonstrates that Bard outperforms previous LVLMs in most multimodal\ncapabilities except object hallucination, to which Bard is still susceptible.\nTiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages\ninnovative strategies aimed at advancing multimodal techniques. Our project is\npublicly available at \\url{https://github.com/OpenGVLab/Multi-Modality-Arena}.", "comment": "accepted to IEEE Transactions on Big Data. Project Page:\n  http://lvlm-ehub.opengvlab.com/", "links": []}
{"entry_id": "2408.04810", "title": "UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling", "authors": ["Haider Al-Tahan", "Quentin Garrido", "Randall Balestriero", "Diane Bouchacourt", "Caner Hazirbas", "Mark Ibrahim"], "published": "2024-08-09 01:41:05", "updated": "2024-08-09 01:41:05", "summary": "Significant research efforts have been made to scale and improve\nvision-language model (VLM) training approaches. Yet, with an ever-growing\nnumber of benchmarks, researchers are tasked with the heavy burden of\nimplementing each protocol, bearing a non-trivial computational cost, and\nmaking sense of how all these benchmarks translate into meaningful axes of\nprogress. To facilitate a systematic evaluation of VLM progress, we introduce\nUniBench: a unified implementation of 50+ VLM benchmarks spanning a\ncomprehensive range of carefully categorized capabilities from object\nrecognition to spatial awareness, counting, and much more. We showcase the\nutility of UniBench for measuring progress by evaluating nearly 60 publicly\navailable vision-language models, trained on scales of up to 12.8B samples. We\nfind that while scaling training data or model size can boost many\nvision-language model capabilities, scaling offers little benefit for reasoning\nor relations. Surprisingly, we also discover today's best VLMs struggle on\nsimple digit recognition and counting tasks, e.g. MNIST, which much simpler\nnetworks can solve. Where scale falls short, we find that more precise\ninterventions, such as data quality or tailored-learning objectives offer more\npromise. For practitioners, we also offer guidance on selecting a suitable VLM\nfor a given application. Finally, we release an easy-to-run UniBench code-base\nwith the full set of 50+ benchmarks and comparisons across 59 models as well as\na distilled, representative set of benchmarks that runs in 5 minutes on a\nsingle GPU.", "comment": null, "links": []}
{"entry_id": "2303.10428", "title": "RCA: Region Conditioned Adaptation for Visual Abductive Reasoning", "authors": ["Hao Zhang", "Yeo Keat Ee", "Basura Fernando"], "published": "2023-03-18 14:46:44", "updated": "2024-08-07 13:44:06", "summary": "Visual abductive reasoning aims to make likely explanations for visual\nobservations. We propose a simple yet effective Region Conditioned Adaptation,\na hybrid parameter-efficient fine-tuning method that equips the frozen CLIP\nwith the ability to infer explanations from local visual cues. We encode\n``local hints'' and ``global contexts'' into visual prompts of the CLIP model\nseparately at fine and coarse-grained levels. Adapters are used for fine-tuning\nCLIP models for downstream tasks and we design a new attention adapter, that\ndirectly steers the focus of the attention map with trainable query and key\nprojections of a frozen CLIP model. Finally, we train our new model with a\nmodified contrastive loss to regress the visual feature simultaneously toward\nfeatures of literal description and plausible explanations. The loss enables\nCLIP to maintain both perception and reasoning abilities. Experiments on the\nSherlock visual abductive reasoning benchmark show that the RCA significantly\noutstands previous SOTAs, ranking the \\nth{1} on the leaderboards (e.g., Human\nAcc: RCA 31.74 \\textit{vs} CPT-CLIP 29.58, higher =better). We also validate\nthe RCA is generalizable to local perception benchmarks like RefCOCO. We\nopen-source our project at\n\\textit{\\color{magenta}{\\url{https://github.com/LUNAProject22/RPA}}}.", "comment": "13 pages, 11 figures, ACM Multimedia 2024", "links": []}
{"entry_id": "2408.02882", "title": "Compromising Embodied Agents with Contextual Backdoor Attacks", "authors": ["Aishan Liu", "Yuguang Zhou", "Xianglong Liu", "Tianyuan Zhang", "Siyuan Liang", "Jiakai Wang", "Yanjun Pu", "Tianlin Li", "Junqi Zhang", "Wenbo Zhou", "Qing Guo", "Dacheng Tao"], "published": "2024-08-06 01:20:12", "updated": "2024-08-06 01:20:12", "summary": "Large language models (LLMs) have transformed the development of embodied\nintelligence. By providing a few contextual demonstrations, developers can\nutilize the extensive internal knowledge of LLMs to effortlessly translate\ncomplex tasks described in abstract language into sequences of code snippets,\nwhich will serve as the execution logic for embodied agents. However, this\npaper uncovers a significant backdoor security threat within this process and\nintroduces a novel method called \\method{}. By poisoning just a few contextual\ndemonstrations, attackers can covertly compromise the contextual environment of\na black-box LLM, prompting it to generate programs with context-dependent\ndefects. These programs appear logically sound but contain defects that can\nactivate and induce unintended behaviors when the operational agent encounters\nspecific triggers in its interactive environment. To compromise the LLM's\ncontextual environment, we employ adversarial in-context generation to optimize\npoisoned demonstrations, where an LLM judge evaluates these poisoned prompts,\nreporting to an additional LLM that iteratively optimizes the demonstration in\na two-player adversarial game using chain-of-thought reasoning. To enable\ncontext-dependent behaviors in downstream agents, we implement a dual-modality\nactivation strategy that controls both the generation and execution of program\ndefects through textual and visual triggers. We expand the scope of our attack\nby developing five program defect modes that compromise key aspects of\nconfidentiality, integrity, and availability in embodied agents. To validate\nthe effectiveness of our approach, we conducted extensive experiments across\nvarious tasks, including robot planning, robot manipulation, and compositional\nvisual reasoning. Additionally, we demonstrate the potential impact of our\napproach by successfully attacking real-world autonomous driving systems.", "comment": null, "links": []}
{"entry_id": "2408.02210", "title": "ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning", "authors": ["Yuxuan Wang", "Alan Yuille", "Zhuowan Li", "Zilong Zheng"], "published": "2024-08-05 03:22:10", "updated": "2024-08-05 03:22:10", "summary": "Compositional visual reasoning methods, which translate a complex query into\na structured composition of feasible visual tasks, have exhibited a strong\npotential in complicated multi-modal tasks. Empowered by recent advances in\nlarge language models (LLMs), this multi-modal challenge has been brought to a\nnew stage by treating LLMs as few-shot/zero-shot planners, i.e.,\nvision-language (VL) programming. Such methods, despite their numerous merits,\nsuffer from challenges due to LLM planning mistakes or inaccuracy of visual\nexecution modules, lagging behind the non-compositional models. In this work,\nwe devise a \"plug-and-play\" method, ExoViP, to correct errors in both the\nplanning and execution stages through introspective verification. We employ\nverification modules as \"exoskeletons\" to enhance current VL programming\nschemes. Specifically, our proposed verification module utilizes a mixture of\nthree sub-verifiers to validate predictions after each reasoning step,\nsubsequently calibrating the visual module predictions and refining the\nreasoning trace planned by LLMs. Experimental results on two representative VL\nprogramming methods showcase consistent improvements on five compositional\nreasoning tasks on standard benchmarks. In light of this, we believe that\nExoViP can foster better performance and generalization on open-domain\nmulti-modal challenges.", "comment": "To Appear at COLM 2024", "links": []}
{"entry_id": "2404.17672", "title": "BlenderAlchemy: Editing 3D Graphics with Vision-Language Models", "authors": ["Ian Huang", "Guandao Yang", "Leonidas Guibas"], "published": "2024-04-26 19:37:13", "updated": "2024-08-02 21:33:21", "summary": "Graphics design is important for various applications, including movie\nproduction and game design. To create a high-quality scene, designers usually\nneed to spend hours in software like Blender, in which they might need to\ninterleave and repeat operations, such as connecting material nodes, hundreds\nof times. Moreover, slightly different design goals may require completely\ndifferent sequences, making automation difficult. In this paper, we propose a\nsystem that leverages Vision-Language Models (VLMs), like GPT-4V, to\nintelligently search the design action space to arrive at an answer that can\nsatisfy a user's intent. Specifically, we design a vision-based edit generator\nand state evaluator to work together to find the correct sequence of actions to\nachieve the goal. Inspired by the role of visual imagination in the human\ndesign process, we supplement the visual reasoning capabilities of VLMs with\n\"imagined\" reference images from image-generation models, providing visual\ngrounding of abstract language descriptions. In this paper, we provide\nempirical evidence suggesting our system can produce simple but tedious Blender\nediting sequences for tasks such as editing procedural materials and geometry\nfrom text and/or reference images, as well as adjusting lighting configurations\nfor product renderings in complex scenes.", "comment": null, "links": []}
{"entry_id": "2407.21438", "title": "A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap", "authors": ["Lijun Zhang", "Wei Suo", "Peng Wang", "Yanning Zhang"], "published": "2024-07-31 08:42:48", "updated": "2024-07-31 08:42:48", "summary": "Human-object interactions (HOI) detection aims at capturing human-object\npairs in images and corresponding actions. It is an important step toward\nhigh-level visual reasoning and scene understanding. However, due to the\nnatural bias from the real world, existing methods mostly struggle with rare\nhuman-object pairs and lead to sub-optimal results. Recently, with the\ndevelopment of the generative model, a straightforward approach is to construct\na more balanced dataset based on a group of supplementary samples.\nUnfortunately, there is a significant domain gap between the generated data and\nthe original data, and simply merging the generated images into the original\ndataset cannot significantly boost the performance. To alleviate the above\nproblem, we present a novel model-agnostic framework called\n\\textbf{C}ontext-\\textbf{E}nhanced \\textbf{F}eature \\textbf{A}lignment (CEFA)\nmodule, which can effectively align the generated data with the original data\nat the feature level and bridge the domain gap. Specifically, CEFA consists of\na feature alignment module and a context enhancement module. On one hand,\nconsidering the crucial role of human-object pairs information in HOI tasks,\nthe feature alignment module aligns the human-object pairs by aggregating\ninstance information. On the other hand, to mitigate the issue of losing\nimportant context information caused by the traditional discriminator-style\nalignment method, we employ a context-enhanced image reconstruction module to\nimprove the model's learning ability of contextual cues. Extensive experiments\nhave shown that our method can serve as a plug-and-play module to improve the\ndetection performance of HOI models on rare\ncategories\\footnote{https://github.com/LijunZhang01/CEFA}.", "comment": null, "links": []}
{"entry_id": "2407.21333", "title": "Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM", "authors": ["Can Wang", "Hongliang Zhong", "Menglei Chai", "Mingming He", "Dongdong Chen", "Jing Liao"], "published": "2024-07-31 04:49:46", "updated": "2024-07-31 04:49:46", "summary": "Automatic furniture layout is long desired for convenient interior design.\nLeveraging the remarkable visual reasoning capabilities of multimodal large\nlanguage models (MLLMs), recent methods address layout generation in a static\nmanner, lacking the feedback-driven refinement essential for interactive user\nengagement. We introduce Chat2Layout, a novel interactive furniture layout\ngeneration system that extends the functionality of MLLMs into the realm of\ninteractive layout design. To achieve this, we establish a unified\nvision-question paradigm for in-context learning, enabling seamless\ncommunication with MLLMs to steer their behavior without altering model\nweights. Within this framework, we present a novel training-free visual\nprompting mechanism. This involves a visual-text prompting technique that\nassist MLLMs in reasoning about plausible layout plans, followed by an\nOffline-to-Online search (O2O-Search) method, which automatically identifies\nthe minimal set of informative references to provide exemplars for visual-text\nprompting. By employing an agent system with MLLMs as the core controller, we\nenable bidirectional interaction. The agent not only comprehends the 3D\nenvironment and user requirements through linguistic and visual perception but\nalso plans tasks and reasons about actions to generate and arrange furniture\nwithin the virtual space. Furthermore, the agent iteratively updates based on\nvisual feedback from execution results. Experimental results demonstrate that\nour approach facilitates language-interactive generation and arrangement for\ndiverse and complex 3D furniture.", "comment": "Main paper with supplemental materials", "links": []}
{"entry_id": "2407.20563", "title": "Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering", "authors": ["Ruoyue Shen", "Nakamasa Inoue", "Koichi Shinoda"], "published": "2024-07-30 05:36:43", "updated": "2024-07-30 05:36:43", "summary": "Visual question answering (VQA) is the task of providing accurate answers to\nnatural language questions based on visual input. Programmatic VQA (PVQA)\nmodels have been gaining attention recently. These use large language models\n(LLMs) to formulate executable programs that address questions requiring\ncomplex visual reasoning. However, there are challenges in enabling LLMs to\ncomprehend the usage of image processing modules and generate relevant code. To\novercome these challenges, this paper introduces PyramidCoder, a novel\nprompting framework for PVQA models. PyramidCoder consists of three\nhierarchical levels, each serving a distinct purpose: query rephrasing, code\ngeneration, and answer aggregation. Notably, PyramidCoder utilizes a single\nfrozen LLM and pre-defined prompts at each level, eliminating the need for\nadditional training and ensuring flexibility across various LLM architectures.\nCompared to the state-of-the-art PVQA model, our approach improves accuracy by\nat least 0.5% on the GQA dataset, 1.4% on the VQAv2 dataset, and 2.9% on the\nNLVR2 dataset.", "comment": "Accepted to the IEEE International Conference on Image Processing\n  (IEEE ICIP) 2024", "links": []}
{"entry_id": "2407.19666", "title": "Take A Step Back: Rethinking the Two Stages in Visual Reasoning", "authors": ["Mingyu Zhang", "Jiting Cai", "Mingyu Liu", "Yue Xu", "Cewu Lu", "Yong-Lu Li"], "published": "2024-07-29 02:56:19", "updated": "2024-07-29 02:56:19", "summary": "Visual reasoning, as a prominent research area, plays a crucial role in AI by\nfacilitating concept formation and interaction with the world. However, current\nworks are usually carried out separately on small datasets thus lacking\ngeneralization ability. Through rigorous evaluation of diverse benchmarks, we\ndemonstrate the shortcomings of existing ad-hoc methods in achieving\ncross-domain reasoning and their tendency to data bias fitting. In this paper,\nwe revisit visual reasoning with a two-stage perspective: (1) symbolization and\n(2) logical reasoning given symbols or their representations. We find that the\nreasoning stage is better at generalization than symbolization. Thus, it is\nmore efficient to implement symbolization via separated encoders for different\ndata domains while using a shared reasoner. Given our findings, we establish\ndesign principles for visual reasoning frameworks following the separated\nsymbolization and shared reasoning. The proposed two-stage framework achieves\nimpressive generalization ability on various visual reasoning tasks, including\npuzzles, physical prediction, and visual question answering (VQA), encompassing\nboth 2D and 3D modalities. We believe our insights will pave the way for\ngeneralizable visual reasoning.", "comment": "ECCV 2024, Project page:\n  https://mybearyzhang.github.io/projects/TwoStageReason/", "links": []}
{"entry_id": "2407.17791", "title": "Investigating learning-independent abstract reasoning in artificial neural networks", "authors": ["Tomer Barak", "Yonatan Loewenstein"], "published": "2024-07-25 05:58:58", "updated": "2024-07-25 05:58:58", "summary": "Humans are capable of solving complex abstract reasoning tests. Whether this\nability reflects a learning-independent inference mechanism applicable to any\nnovel unlearned problem or whether it is a manifestation of extensive training\nthroughout life is an open question. Addressing this question in humans is\nchallenging because it is impossible to control their prior training. However,\nassuming a similarity between the cognitive processing of Artificial Neural\nNetworks (ANNs) and humans, the extent to which training is required for ANNs'\nabstract reasoning is informative about this question in humans. Previous\nstudies demonstrated that ANNs can solve abstract reasoning tests. However,\nthis success required extensive training. In this study, we examined the\nlearning-independent abstract reasoning of ANNs. Specifically, we evaluated\ntheir performance without any pretraining, with the ANNs' weights being\nrandomly-initialized, and only change in the process of problem solving. We\nfound that naive ANN models can solve non-trivial visual reasoning tests,\nsimilar to those used to evaluate human learning-independent reasoning. We\nfurther studied the mechanisms that support this ability. Our results suggest\nthe possibility of learning-independent abstract reasoning that does not\nrequire extensive training.", "comment": null, "links": []}
{"entry_id": "2407.17773", "title": "KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models", "authors": ["Eunice Yiu", "Maan Qraitem", "Charlie Wong", "Anisa Noor Majhi", "Yutong Bai", "Shiry Ginosar", "Alison Gopnik", "Kate Saenko"], "published": "2024-07-25 05:02:39", "updated": "2024-07-25 05:02:39", "summary": "This paper investigates visual analogical reasoning in large multimodal\nmodels (LMMs) compared to human adults and children. A \"visual analogy\" is an\nabstract rule inferred from one image and applied to another. While benchmarks\nexist for testing visual reasoning in LMMs, they require advanced skills and\nomit basic visual analogies that even young children can make. Inspired by\ndevelopmental psychology, we propose a new benchmark of 1,400 visual\ntransformations of everyday objects to test LMMs on visual analogical reasoning\nand compare them to children and adults. We structure the evaluation into three\nstages: identifying what changed (e.g., color, number, etc.), how it changed\n(e.g., added one object), and applying the rule to new scenarios. Our findings\nshow that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the \"what\"\neffectively, they struggle with quantifying the \"how\" and extrapolating this\nrule to new objects. In contrast, children and adults exhibit much stronger\nanalogical reasoning at all three stages. Additionally, the strongest tested\nmodel, GPT-4V, performs better in tasks involving simple visual attributes like\ncolor and size, correlating with quicker human adult response times.\nConversely, more complex tasks such as number, rotation, and reflection, which\nnecessitate extensive cognitive processing and understanding of the 3D physical\nworld, present more significant challenges. Altogether, these findings\nhighlight the limitations of training models on data that primarily consists of\n2D images and text.", "comment": "9 pages. For the KiVA benchmark, see https://github.com/ey242/KiVA", "links": []}
{"entry_id": "2403.16921", "title": "PropTest: Automatic Property Testing for Improved Visual Programming", "authors": ["Jaywon Koo", "Ziyan Yang", "Paola Cascante-Bonilla", "Baishakhi Ray", "Vicente Ordonez"], "published": "2024-03-25 16:39:15", "updated": "2024-07-22 23:21:33", "summary": "Visual Programming has recently emerged as an alternative to end-to-end\nblack-box visual reasoning models. This type of method leverages Large Language\nModels (LLMs) to generate the source code for an executable computer program\nthat solves a given problem. This strategy has the advantage of offering an\ninterpretable reasoning path and does not require finetuning a model with\ntask-specific data. We propose PropTest, a general strategy that improves\nvisual programming by further using an LLM to generate code that tests for\nvisual properties in an initial round of proposed solutions. Our method\ngenerates tests for data-type consistency, output syntax, and semantic\nproperties. PropTest achieves comparable results to state-of-the-art methods\nwhile using publicly available LLMs. This is demonstrated across different\nbenchmarks on visual question answering and referring expression comprehension.\nParticularly, PropTest improves ViperGPT by obtaining 46.1\\% accuracy (+6.0\\%)\non GQA using Llama3-8B and 59.5\\% (+8.1\\%) on RefCOCO+ using CodeLlama-34B.", "comment": "Project Page: https://jaywonkoo17.github.io/PropTest/", "links": []}
{"entry_id": "2403.12884", "title": "HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning", "authors": ["Fucai Ke", "Zhixi Cai", "Simindokht Jahangard", "Weiqing Wang", "Pari Delir Haghighi", "Hamid Rezatofighi"], "published": "2024-03-19 16:31:30", "updated": "2024-07-21 08:48:55", "summary": "Recent advances in visual reasoning (VR), particularly with the aid of Large\nVision-Language Models (VLMs), show promise but require access to large-scale\ndatasets and face challenges such as high computational costs and limited\ngeneralization capabilities. Compositional visual reasoning approaches have\nemerged as effective strategies; however, they heavily rely on the commonsense\nknowledge encoded in Large Language Models (LLMs) to perform planning,\nreasoning, or both, without considering the effect of their decisions on the\nvisual reasoning process, which can lead to errors or failed procedures. To\naddress these challenges, we introduce HYDRA, a multi-stage dynamic\ncompositional visual reasoning framework designed for reliable and\nincrementally progressive general reasoning. HYDRA integrates three essential\nmodules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive\ncontroller, and a reasoner. The planner and reasoner modules utilize an LLM to\ngenerate instruction samples and executable code from the selected instruction,\nrespectively, while the RL agent dynamically interacts with these modules,\nmaking high-level decisions on selection of the best instruction sample given\ninformation from the historical state stored through a feedback loop. This\nadaptable design enables HYDRA to adjust its actions based on previous feedback\nreceived during the reasoning process, leading to more reliable reasoning\noutputs and ultimately enhancing its overall effectiveness. Our framework\ndemonstrates state-of-the-art performance in various VR tasks on four different\nwidely-used datasets.", "comment": "Accepted by ECCV2024. Project page: https://hydra-vl4ai.github.io/", "links": []}
{"entry_id": "2407.14834", "title": "Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators", "authors": ["Harsh Lunia"], "published": "2024-07-20 10:26:28", "updated": "2024-07-20 10:26:28", "summary": "Recent advancements have introduced multiple vision-language models (VLMs)\ndemonstrating impressive commonsense reasoning across various domains. Despite\ntheir individual capabilities, the potential of synergizing these complementary\nVLMs remains underexplored. The Cola Framework addresses this by showcasing how\na large language model (LLM) can efficiently coordinate multiple VLMs through\nnatural language communication, leveraging their distinct strengths. We have\nverified this claim on the challenging A-OKVQA dataset, confirming the\neffectiveness of such coordination. Building on this, our study investigates\nwhether the same methodology can be applied to surveillance videos for action\nrecognition. Specifically, we explore if leveraging the combined knowledge base\nof VLMs and LLM can effectively deduce actions from a video when presented with\nonly a few selectively important frames and minimal temporal information. Our\nexperiments demonstrate that LLM, when coordinating different VLMs, can\nsuccessfully recognize patterns and deduce actions in various scenarios despite\nthe weak temporal signals. However, our findings suggest that to enhance this\napproach as a viable alternative solution, integrating a stronger temporal\nsignal and exposing the models to slightly more frames would be beneficial.", "comment": "LLMs, VLMs, Action Recognition", "links": []}
{"entry_id": "2407.13851", "title": "X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs", "authors": ["Sirnam Swetha", "Jinyu Yang", "Tal Neiman", "Mamshad Nayeem Rizve", "Son Tran", "Benjamin Yao", "Trishul Chilimbi", "Mubarak Shah"], "published": "2024-07-18 18:39:54", "updated": "2024-07-18 18:39:54", "summary": "Recent advancements in Multimodal Large Language Models (MLLMs) have\nrevolutionized the field of vision-language understanding by integrating visual\nperception capabilities into Large Language Models (LLMs). The prevailing trend\nin this field involves the utilization of a vision encoder derived from\nvision-language contrastive learning (CL), showing expertise in capturing\noverall representations while facing difficulties in capturing detailed local\npatterns. In this work, we focus on enhancing the visual representations for\nMLLMs by combining high-frequency and detailed visual representations, obtained\nthrough masked image modeling (MIM), with semantically-enriched low-frequency\nrepresentations captured by CL. To achieve this goal, we introduce X-Former\nwhich is a lightweight transformer module designed to exploit the complementary\nstrengths of CL and MIM through an innovative interaction mechanism.\nSpecifically, X-Former first bootstraps vision-language representation learning\nand multimodal-to-multimodal generative learning from two frozen vision\nencoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further\nbootstraps vision-to-language generative learning from a frozen LLM to ensure\nvisual features from X-Former can be interpreted by the LLM. To demonstrate the\neffectiveness of our approach, we assess its performance on tasks demanding\ndetailed visual understanding. Extensive evaluations indicate that X-Former\nexcels in visual reasoning tasks involving both structural and semantic\ncategories in the GQA dataset. Assessment on fine-grained visual perception\nbenchmark further confirms its superior capabilities in visual understanding.", "comment": "Accepted at ECCV2024", "links": []}
{"entry_id": "2407.13382", "title": "Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols", "authors": ["Gertjan Burghouts", "Fieke Hillerström", "Erwin Walraven", "Michael van Bekkum", "Frank Ruis", "Joris Sijs", "Jelle van Mil", "Judith Dijk"], "published": "2024-07-18 10:40:22", "updated": "2024-07-18 10:40:22", "summary": "We consider the problem of finding spatial configurations of multiple objects\nin images, e.g., a mobile inspection robot is tasked to localize abandoned\ntools on the floor. We define the spatial configuration of objects by\nfirst-order logic in terms of relations and attributes. A neuro-symbolic\nprogram matches the logic formulas to probabilistic object proposals for the\ngiven image, provided by language-vision models by querying them for the\nsymbols. This work is the first to combine neuro-symbolic programming\n(reasoning) and language-vision models (learning) to find spatial\nconfigurations of objects in images in an open world setting. We show the\neffectiveness by finding abandoned tools on floors and leaking pipes. We find\nthat most prediction errors are due to biases in the language-vision model.", "comment": "12 pages", "links": []}
{"entry_id": "2401.13311", "title": "ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models", "authors": ["Rohan Wadhawan", "Hritik Bansal", "Kai-Wei Chang", "Nanyun Peng"], "published": "2024-01-24 09:07:11", "updated": "2024-07-16 03:36:29", "summary": "Many real-world tasks require an agent to reason jointly over text and visual\nobjects, (e.g., navigating in public spaces), which we refer to as\ncontext-sensitive text-rich visual reasoning. Specifically, these tasks require\nan understanding of the context in which the text interacts with visual\nelements within an image. However, there is a lack of existing datasets to\nbenchmark the state-of-the-art multimodal models' capability on\ncontext-sensitive text-rich visual reasoning. In this paper, we introduce\nConTextual, a novel dataset featuring human-crafted instructions that require\ncontext-sensitive reasoning for text-rich images. We conduct experiments to\nassess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision,\nLLaVA-Next) and establish a human performance baseline. Further, we perform\nhuman evaluations of the model responses and observe a significant performance\ngap of 30.8% between GPT-4V (the current best-performing Large Multimodal\nModel) and human performance. Our fine-grained analysis reveals that GPT-4V\nencounters difficulties interpreting time-related data and infographics.\nHowever, it demonstrates proficiency in comprehending abstract visual contexts\nsuch as memes and quotes. Finally, our qualitative analysis uncovers various\nfactors contributing to poor performance including lack of precise visual\nperception and hallucinations. Our dataset, code, and leaderboard can be found\non the project page https://con-textual.github.io/", "comment": null, "links": []}
{"entry_id": "2407.10380", "title": "NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models", "authors": ["Pranshu Pandya", "Agney S Talwarr", "Vatsal Gupta", "Tushar Kataria", "Vivek Gupta", "Dan Roth"], "published": "2024-07-15 01:21:56", "updated": "2024-07-15 01:21:56", "summary": "Cognitive textual and visual reasoning tasks, such as puzzles, series, and\nanalogies, demand the ability to quickly reason, decipher, and evaluate\npatterns both textually and spatially. While LLMs and VLMs, through extensive\ntraining on large amounts of human-curated data, have attained a high level of\npseudo-human intelligence in some common sense reasoning tasks, they still\nstruggle with more complex reasoning tasks that require cognitive\nunderstanding. In this work, we introduce a new dataset, NTSEBench, designed to\nevaluate the cognitive multi-modal reasoning and problem-solving skills of\nlarge models. The dataset comprises 2,728 multiple-choice questions comprising\nof a total of 4,642 images across 26 categories sampled from the NTSE\nexamination conducted nationwide in India, featuring both visual and textual\ngeneral aptitude questions that do not rely on rote learning. We establish\nbaselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a\ncomparison between open source and propriety models, we propose four distinct\nmodeling strategies to handle different modalities (text and images) in the\ndataset instances.", "comment": "15 pages, 2 figures, 5 tables", "links": []}
{"entry_id": "2306.06094", "title": "Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding", "authors": ["Mu Cai", "Zeyi Huang", "Yuheng Li", "Utkarsh Ojha", "Haohan Wang", "Yong Jae Lee"], "published": "2023-06-09 17:57:01", "updated": "2024-07-11 17:59:53", "summary": "Large language models (LLMs) have made significant advancements in natural\nlanguage understanding. However, through that enormous semantic representation\nthat the LLM has learnt, is it somehow possible for it to understand images as\nwell? This work investigates this question. To enable the LLM to process\nimages, we convert them into a representation given by Scalable Vector Graphics\n(SVG). To study what the LLM can do with this XML-based textual description of\nimages, we test the LLM on three broad computer vision tasks: (i) visual\nreasoning and question answering, (ii) image classification under distribution\nshift, few-shot learning, and (iii) generating new images using visual\nprompting. Even though we do not naturally associate LLMs with any visual\nunderstanding capabilities, our results indicate that the LLM can often do a\ndecent job in many of these tasks, potentially opening new avenues for research\ninto LLMs' ability to understand image data. Our code, data, and models can be\nfound here https://github.com/mu-cai/svg-llm.", "comment": null, "links": []}
{"entry_id": "2407.08672", "title": "NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning", "authors": ["Yi Zhang", "Chun-Wun Cheng", "Ke Yu", "Zhihai He", "Carola-Bibiane Schönlieb", "Angelica I. Aviles-Rivero"], "published": "2024-07-11 17:04:19", "updated": "2024-07-11 17:04:19", "summary": "In this paper, we consider the problem of prototype-based vision-language\nreasoning problem. We observe that existing methods encounter three major\nchallenges: 1) escalating resource demands and prolonging training times, 2)\ncontending with excessive learnable parameters, and 3) fine-tuning based only\non a single modality. These challenges will hinder their capability to adapt\nVision-Language Models (VLMs) to downstream tasks. Motivated by this critical\nobservation, we propose a novel method called NODE-Adapter, which utilizes\nNeural Ordinary Differential Equations for better vision-language reasoning. To\nfully leverage both visual and textual modalities and estimate class prototypes\nmore effectively and accurately, we divide our method into two stages:\ncross-modal prototype construction and cross-modal prototype optimization using\nneural ordinary differential equations. Specifically, we exploit VLM to encode\nhand-crafted prompts into textual features and few-shot support images into\nvisual features. Then, we estimate the textual prototype and visual prototype\nby averaging the textual features and visual features, respectively, and\nadaptively combine the textual prototype and visual prototype to construct the\ncross-modal prototype. To alleviate the prototype bias, we then model the\nprototype optimization process as an initial value problem with Neural ODEs to\nestimate the continuous gradient flow. Our extensive experimental results,\nwhich cover few-shot classification, domain generalization, and visual\nreasoning on human-object interaction, demonstrate that the proposed method\nsignificantly outperforms existing state-of-the-art approaches.", "comment": null, "links": []}
{"entry_id": "2406.09403", "title": "Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models", "authors": ["Yushi Hu", "Weijia Shi", "Xingyu Fu", "Dan Roth", "Mari Ostendorf", "Luke Zettlemoyer", "Noah A Smith", "Ranjay Krishna"], "published": "2024-06-13 17:59:31", "updated": "2024-07-10 18:09:56", "summary": "Humans draw to facilitate reasoning: we draw auxiliary lines when solving\ngeometry problems; we mark and circle when reasoning on maps; we use sketches\nto amplify our ideas and relieve our limited-capacity working memory. However,\nsuch actions are missing in current multimodal language models (LMs). Current\nchain-of-thought and tool-use paradigms only use text as intermediate reasoning\nsteps. In this work, we introduce Sketchpad, a framework that gives multimodal\nLMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts\nplanning and reasoning according to the visual artifacts it has drawn.\nDifferent from prior work, which uses text-to-image models to enable LMs to\ndraw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is\ncloser to human sketching and better facilitates reasoning. Sketchpad can also\nuse specialist vision models during the sketching process (e.g., draw bounding\nboxes with object detection models, draw masks with segmentation models), to\nfurther enhance visual perception and reasoning. We experiment with a wide\nrange of math tasks (including geometry, functions, graphs, and chess) and\ncomplex visual reasoning tasks. Sketchpad substantially improves performance on\nall tasks over strong base models with no sketching, yielding an average gain\nof 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a\nnew state of the art on all tasks, including V*Bench (80.3%), BLINK spatial\nreasoning (83.9%), and visual correspondence (80.8%). All codes and data are in\nhttps://visualsketchpad.github.io/.", "comment": "Project and codes url: https://visualsketchpad.github.io/", "links": []}
{"entry_id": "2407.02688", "title": "Funny-Valen-Tine: Planning Solution Distribution Enhances Machine Abstract Reasoning Ability", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-07-02 22:04:20", "updated": "2024-07-07 12:25:33", "summary": "Visual abstract reasoning problems hold immense importance in the field of\nimage processing. Both Bongard-Logo and Raven's Progressive Matrices (RPM)\nbelong to this domain, with Bongard-Logo categorized as image clustering\nreasoning and RPM involving image progression pattern reasoning. This paper\nintroduces Valen, a novel baseline model under probabilistic highlighting\nmodels. Valen exhibits remarkable performance in solving both RPM and\nBongard-Logo problems, offering a versatile solution. Our investigation delves\ninto the underlying mechanisms of probability-highlighting solvers, realizing\nthey approximate solutions to reasoning problem instances as distributions\ndelineated by primary and auxiliary samples. We propose that the learning\nobjective is not the distribution of correct solutions but one defined by both\nprimary and auxiliary samples. To bridge discrepancies, we introduced the Tine\nmethod, an adversarial learning-based approach to assist Valen in estimating a\nsolution distribution closer to the correct one, albeit with issues like\nunstable training. Reflecting on Tine, we propose modeling the sample\ndistribution of reasoning problems as a mixture of Gaussian distributions,\nleading to the Funny method. This effectively enables Valen to capture the true\nform of the correct solution distribution. Furthermore, we designed the SBR\nmethod to model the distribution of progressive patterns representation\nsimilarly. Overall, the Funny, Tine, and SBR methods significantly improve\nValen's performance, providing new ideas and methods for studying visual\nabstract reasoning problems.", "comment": "14 pages, 20 figures, 3 tables", "links": []}
{"entry_id": "2407.01284", "title": "We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?", "authors": ["Runqi Qiao", "Qiuna Tan", "Guanting Dong", "Minhui Wu", "Chong Sun", "Xiaoshuai Song", "Zhuoma GongQue", "Shanglin Lei", "Zhe Wei", "Miaoxuan Zhang", "Runfeng Qiao", "Yifan Zhang", "Xiao Zong", "Yida Xu", "Muxi Diao", "Zhimin Bao", "Chen Li", "Honggang Zhang"], "published": "2024-07-01 13:39:08", "updated": "2024-07-01 13:39:08", "summary": "Visual mathematical reasoning, as a fundamental visual reasoning ability, has\nreceived widespread attention from the Large Multimodal Models (LMMs)\ncommunity. Existing benchmarks, such as MathVista and MathVerse, focus more on\nthe result-oriented performance but neglect the underlying principles in\nknowledge acquisition and generalization. Inspired by human-like mathematical\nreasoning, we introduce WE-MATH, the first benchmark specifically designed to\nexplore the problem-solving principles beyond end-to-end performance. We\nmeticulously collect and categorize 6.5K visual math problems, spanning 67\nhierarchical knowledge concepts and five layers of knowledge granularity. We\ndecompose composite problems into sub-problems according to the required\nknowledge concepts and introduce a novel four-dimensional metric, namely\nInsufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery\n(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in\nLMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of\nexisting LMMs in visual mathematical reasoning and reveal a negative\ncorrelation between solving steps and problem-specific performance. We confirm\nthe IK issue of LMMs can be effectively improved via knowledge augmentation\nstrategies. More notably, the primary challenge of GPT-4o has significantly\ntransitioned from IK to IG, establishing it as the first LMM advancing towards\nthe knowledge generalization stage. In contrast, other LMMs exhibit a marked\ninclination towards Rote Memorization - they correctly solve composite problems\ninvolving multiple knowledge concepts yet fail to answer sub-problems. We\nanticipate that WE-MATH will open new pathways for advancements in visual\nmathematical reasoning for LMMs. The WE-MATH data and evaluation code are\navailable at https://github.com/We-Math/We-Math.", "comment": "Work in progress", "links": []}
{"entry_id": "2310.04671", "title": "Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction", "authors": ["Korawat Charoenpitaks", "Van-Quang Nguyen", "Masanori Suganuma", "Masahiro Takahashi", "Ryoma Niihara", "Takayuki Okatani"], "published": "2023-10-07 03:16:30", "updated": "2024-07-01 09:29:39", "summary": "This paper addresses the problem of predicting hazards that drivers may\nencounter while driving a car. We formulate it as a task of anticipating\nimpending accidents using a single input image captured by car dashcams. Unlike\nexisting approaches to driving hazard prediction that rely on computational\nsimulations or anomaly detection from videos, this study focuses on high-level\ninference from static images. The problem needs predicting and reasoning about\nfuture events based on uncertain observations, which falls under visual\nabductive reasoning. To enable research in this understudied area, a new\ndataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is\ncreated. The dataset consists of 15K dashcam images of street scenes, and each\nimage is associated with a tuple containing car speed, a hypothesized hazard\ndescription, and visual entities present in the scene. These are annotated by\nhuman annotators, who identify risky scenes and provide descriptions of\npotential accidents that could occur a few seconds later. We present several\nbaseline methods and evaluate their performance on our dataset, identifying\nremaining issues and discussing future directions. This study contributes to\nthe field by introducing a novel problem formulation and dataset, enabling\nresearchers to explore the potential of multi-modal AI for driving hazard\nprediction.", "comment": "Main Paper: 11 pages, Supplementary Materials: 25 pages", "links": []}
{"entry_id": "2406.19693", "title": "MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?", "authors": ["Jinming Li", "Yichen Zhu", "Zhiyuan Xu", "Jindong Gu", "Minjie Zhu", "Xin Liu", "Ning Liu", "Yaxin Peng", "Feifei Feng", "Jian Tang"], "published": "2024-06-28 07:09:06", "updated": "2024-06-28 07:09:06", "summary": "It is fundamentally challenging for robots to serve as useful assistants in\nhuman environments because this requires addressing a spectrum of sub-problems\nacross robotics, including perception, language understanding, reasoning, and\nplanning. The recent advancements in Multimodal Large Language Models (MLLMs)\nhave demonstrated their exceptional abilities in solving complex mathematical\nproblems, mastering commonsense and abstract reasoning. This has led to the\nrecent utilization of MLLMs as the brain in robotic systems, enabling these\nmodels to conduct high-level planning prior to triggering low-level control\nactions for task execution. However, it remains uncertain whether existing\nMLLMs are reliable in serving the brain role of robots. In this study, we\nintroduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo)\nbenchmark, which tests the capability of MLLMs for robot applications.\nSpecifically, we identify four essential capabilities perception, task\nplanning, visual reasoning, and safety measurement that MLLMs must possess to\nqualify as the robot's central processing unit. We have developed several\nscenarios for each capability, resulting in a total of 14 metrics for\nevaluation. We present experimental results for various MLLMs, including both\ncommercial and open-source models, to assess the performance of existing\nsystems. Our findings indicate that no single model excels in all areas,\nsuggesting that current MLLMs are not yet trustworthy enough to serve as the\ncognitive core for robots. Our data can be found in\nhttps://mm-robobench.github.io/.", "comment": null, "links": []}
{"entry_id": "2401.15847", "title": "Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA", "authors": ["Yue Fan", "Jing Gu", "Kaiwen Zhou", "Qianqi Yan", "Shan Jiang", "Ching-Chen Kuo", "Xinze Guan", "Xin Eric Wang"], "published": "2024-01-29 02:43:40", "updated": "2024-06-27 15:38:17", "summary": "Multipanel images, commonly seen as web screenshots, posters, etc., pervade\nour daily lives. These images, characterized by their composition of multiple\nsubfigures in distinct layouts, effectively convey information to people.\nToward building advanced multimodal AI applications, such as agents that\nunderstand complex scenes and navigate through webpages, the skill of\nmultipanel visual reasoning is essential, and a comprehensive evaluation of\nmodels in this regard is important. Therefore, we introduce Multipanel Visual\nQuestion Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets\nof questions, answers, and multipanel images that specifically challenge models\nin comprehending multipanel images. Our evaluation shows that questions in the\nMultipanelVQA benchmark pose significant challenges to the state-of-the-art\nMultimodal Large Language Models (MLLMs) tested, even though humans can attain\napproximately 99% accuracy on these questions. Distinctively, the MultipanelVQA\nbenchmark features synthetically generated multipanel images specifically\ncrafted to isolate and assess the impact of various factors, such as the\nlayout, on MLLMs' multipanel image comprehension abilities. As a result, in\naddition to benchmarking the capabilities of MLLMs in understanding multipanel\nimages, we analyze various factors of the multipanel image that affect MLLMs'\nperformance with synthetic data and offer insights for enhancement. Code and\ndata are released at https://sites.google.com/view/multipanelvqa/home.", "comment": "ACL 2024", "links": []}
{"entry_id": "2406.19217", "title": "Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos", "authors": ["Zhimin Shao", "Jialang Xu", "Danail Stoyanov", "Evangelos B. Mazomenos", "Yueming Jin"], "published": "2024-06-27 14:43:50", "updated": "2024-06-27 14:43:50", "summary": "Despite significant advancements in robotic systems and surgical data\nscience, ensuring safe and optimal execution in robot-assisted minimally\ninvasive surgery (RMIS) remains a complex challenge. Current surgical error\ndetection methods involve two parts: identifying surgical gestures and then\ndetecting errors within each gesture clip. These methods seldom consider the\nrich contextual and semantic information inherent in surgical videos, limiting\ntheir performance due to reliance on accurate gesture identification. Motivated\nby the chain-of-thought prompting in natural language processing, this letter\npresents a novel and real-time end-to-end error detection framework,\nChain-of-Thought (COG) prompting, leveraging contextual information from\nsurgical videos. This encompasses two reasoning modules designed to mimic the\ndecision-making processes of expert surgeons. Concretely, we first design a\nGestural-Visual Reasoning module, which utilizes transformer and attention\narchitectures for gesture prompting, while the second, a Multi-Scale Temporal\nReasoning module, employs a multi-stage temporal convolutional network with\nboth slow and fast paths for temporal information extraction. We extensively\nvalidate our method on the public benchmark RMIS dataset JIGSAWS. Our method\nencapsulates the reasoning processes inherent to surgical activities enabling\nit to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy,\nand 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on\naverage, demonstrating the great potential of our approach in enhancing the\nsafety and efficacy of RMIS procedures and surgical education. The code will be\navailable.", "comment": "8 pages, 4 figures", "links": []}
{"entry_id": "2406.18839", "title": "Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA", "authors": ["Elham J. Barezi", "Parisa Kordjamshidi"], "published": "2024-06-27 02:19:38", "updated": "2024-06-27 02:19:38", "summary": "We study the Knowledge-Based visual question-answering problem, for which\ngiven a question, the models need to ground it into the visual modality to find\nthe answer. Although many recent works use question-dependent captioners to\nverbalize the given image and use Large Language Models to solve the VQA\nproblem, the research results show they are not reasonably performing for\nmulti-hop questions. Our study shows that replacing a complex question with\nseveral simpler questions helps to extract more relevant information from the\nimage and provide a stronger comprehension of it. Moreover, we analyze the\ndecomposed questions to find out the modality of the information that is\nrequired to answer them and use a captioner for the visual questions and LLMs\nas a general knowledge source for the non-visual KB-based questions. Our\nresults demonstrate the positive impact of using simple questions before\nretrieving visual or non-visual information. We have provided results and\nanalysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA,\nand achieved up to 2% improvement in accuracy.", "comment": null, "links": []}
{"entry_id": "2407.00092", "title": "Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges", "authors": ["Mohammed Elhenawy", "Ahmad Abutahoun", "Taqwa I. Alhadidi", "Ahmed Jaber", "Huthaifa I. Ashqar", "Shadi Jaradat", "Ahmed Abdelhay", "Sebastien Glaser", "Andry Rakotonirainy"], "published": "2024-06-26 07:12:06", "updated": "2024-06-26 07:12:06", "summary": "Multimodal Large Language Models (MLLMs) harness comprehensive knowledge\nspanning text, images, and audio to adeptly tackle complex problems, including\nzero-shot in-context learning scenarios. This study explores the ability of\nMLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple\nTraveling Salesman Problem (mTSP) using images that portray point distributions\non a two-dimensional plane. We introduce a novel approach employing multiple\nspecialized agents within the MLLM framework, each dedicated to optimizing\nsolutions for these combinatorial challenges. Our experimental investigation\nincludes rigorous evaluations across zero-shot settings and introduces\ninnovative multi-agent zero-shot in-context scenarios. The results demonstrated\nthat both multi-agent models. Multi-Agent 1, which includes the Initializer,\nCritic, and Scorer agents, and Multi-Agent 2, which comprises only the\nInitializer and Critic agents; significantly improved solution quality for TSP\nand mTSP problems. Multi-Agent 1 excelled in environments requiring detailed\nroute refinement and evaluation, providing a robust framework for sophisticated\noptimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by\nthe Initializer and Critic, proved effective for rapid decision-making\nscenarios. These experiments yield promising outcomes, showcasing the robust\nvisual reasoning capabilities of MLLMs in addressing diverse combinatorial\nproblems. The findings underscore the potential of MLLMs as powerful tools in\ncomputational optimization, offering insights that could inspire further\nadvancements in this promising field. Project link:\nhttps://github.com/ahmed-abdulhuy/Solving-TSP-and-mTSP-Combinatorial-Challenges-using-Visual-Reasoning-and-Multi-Agent-Approach-MLLMs-.git", "comment": null, "links": []}
{"entry_id": "2406.16469", "title": "Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration", "authors": ["Yujin Baek", "ChaeHun Park", "Jaeseok Kim", "Yu-Jung Heo", "Du-Seong Chang", "Jaegul Choo"], "published": "2024-06-24 09:18:15", "updated": "2024-06-24 09:18:15", "summary": "To create culturally inclusive vision-language models (VLMs), the foremost\nrequirement is developing a test benchmark that can diagnose the models'\nability to respond to questions reflecting cultural elements. This paper\naddresses the necessity for such benchmarks, noting that existing research has\nrelied on human annotators' manual efforts, which impedes diversity and\nefficiency. We propose a semi-automated pipeline for constructing cultural VLM\nbenchmarks to enhance diversity and efficiency. This pipeline leverages\nhuman-VLM collaboration, where VLMs generate questions based on guidelines,\nhuman-annotated examples, and image-wise relevant knowledge, which are then\nreviewed by native speakers for quality and cultural relevance. The\neffectiveness of our adaptable pipeline is demonstrated through a specific\napplication: creating a dataset tailored to Korean culture, dubbed K-Viscuit.\nThe resulting benchmark features two types of questions: Type 1 questions\nmeasure visual recognition abilities, while Type 2 assess fine-grained visual\nreasoning skills. This ensures a thorough diagnosis of VLM models across\nvarious aspects. Our evaluation using K-Viscuit revealed that open-source\nmodels notably lag behind proprietary models in understanding Korean culture,\nhighlighting areas for improvement. We provided diverse analyses of VLM\nperformance across different cultural aspects. Besides, we explored the\npotential of incorporating external knowledge retrieval to enhance the\ngeneration process, suggesting future directions for improving cultural\ninterpretation ability of VLMs. Our dataset and code will be made publicly\navailable.", "comment": null, "links": []}
{"entry_id": "2406.15955", "title": "Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects", "authors": ["Michael A. Lepori", "Alexa R. Tartaglini", "Wai Keen Vong", "Thomas Serre", "Brenden M. Lake", "Ellie Pavlick"], "published": "2024-06-22 22:43:10", "updated": "2024-06-22 22:43:10", "summary": "Though vision transformers (ViTs) have achieved state-of-the-art performance\nin a variety of settings, they exhibit surprising failures when performing\ntasks involving visual relations. This begs the question: how do ViTs attempt\nto perform tasks that require computing visual relations between objects? Prior\nefforts to interpret ViTs tend to focus on characterizing relevant low-level\nvisual features. In contrast, we adopt methods from mechanistic\ninterpretability to study the higher-level visual algorithms that ViTs use to\nperform abstract visual reasoning. We present a case study of a fundamental,\nyet surprisingly difficult, relational reasoning task: judging whether two\nvisual entities are the same or different. We find that pretrained ViTs\nfine-tuned on this task often exhibit two qualitatively different stages of\nprocessing despite having no obvious inductive biases to do so: 1) a perceptual\nstage wherein local object features are extracted and stored in a disentangled\nrepresentation, and 2) a relational stage wherein object representations are\ncompared. In the second stage, we find evidence that ViTs can learn to\nrepresent somewhat abstract visual relations, a capability that has long been\nconsidered out of reach for artificial neural networks. Finally, we demonstrate\nthat failure points at either stage can prevent a model from learning a\ngeneralizable solution to our fairly simple tasks. By understanding ViTs in\nterms of discrete processing stages, one can more precisely diagnose and\nrectify shortcomings of existing and future models.", "comment": null, "links": []}
{"entry_id": "2406.14562", "title": "Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities", "authors": ["Sachit Menon", "Richard Zemel", "Carl Vondrick"], "published": "2024-06-20 17:59:45", "updated": "2024-06-20 17:59:45", "summary": "When presented with questions involving visual thinking, humans naturally\nswitch reasoning modalities, often forming mental images or drawing visual\naids. Large language models have shown promising results in arithmetic and\nsymbolic reasoning by expressing intermediate reasoning in text as a chain of\nthought, yet struggle to extend this capability to answer text queries that are\neasily solved by visual reasoning, even with extensive multimodal pretraining.\nWe introduce a simple method, whiteboard-of-thought prompting, to unlock the\nvisual reasoning capabilities of multimodal large language models across\nmodalities. Whiteboard-of-thought prompting provides multimodal large language\nmodels with a metaphorical `whiteboard' to draw out reasoning steps as images,\nthen returns these images back to the model for further processing. We find\nthis can be accomplished with no demonstrations or specialized modules, instead\nleveraging models' existing ability to write code with libraries such as\nMatplotlib and Turtle. This simple approach shows state-of-the-art results on\nfour difficult natural language tasks that involve visual and spatial\nreasoning. We identify multiple settings where GPT-4o using chain-of-thought\nfails dramatically, including more than one where it achieves $0\\%$ accuracy,\nwhile whiteboard-of-thought enables up to $92\\%$ accuracy in these same\nsettings. We present a detailed exploration of where the technique succeeds as\nwell as its sources of error.", "comment": "Project website: whiteboard.cs.columbia.edu/", "links": []}
{"entry_id": "2406.13621", "title": "Improving Visual Commonsense in Language Models via Multiple Image Generation", "authors": ["Guy Yariv", "Idan Schwartz", "Yossi Adi", "Sagie Benaim"], "published": "2024-06-19 15:17:10", "updated": "2024-06-19 15:17:10", "summary": "Commonsense reasoning is fundamentally based on multimodal knowledge.\nHowever, existing large language models (LLMs) are primarily trained using\ntextual data only, limiting their ability to incorporate essential visual\ninformation. In contrast, Visual Language Models, which excel at\nvisually-oriented tasks, often fail at non-visual tasks such as basic\ncommonsense reasoning. This divergence highlights a critical challenge - the\nintegration of robust visual understanding with foundational text-based\nlanguage reasoning. To this end, we introduce a method aimed at enhancing LLMs'\nvisual commonsense. Specifically, our method generates multiple images based on\nthe input text prompt and integrates these into the model's decision-making\nprocess by mixing their prediction probabilities. To facilitate multimodal\ngrounded language modeling, we employ a late-fusion layer that combines the\nprojected visual features with the output of a pre-trained LLM conditioned on\ntext only. This late-fusion layer enables predictions based on comprehensive\nimage-text knowledge as well as text only when this is required. We evaluate\nour approach using several visual commonsense reasoning tasks together with\ntraditional NLP tasks, including common sense reasoning and reading\ncomprehension. Our experimental results demonstrate significant superiority\nover existing baselines. When applied to recent state-of-the-art LLMs (e.g.,\nLlama3), we observe improvements not only in visual common sense but also in\ntraditional NLP benchmarks. Code and models are available under\nhttps://github.com/guyyariv/vLMIG.", "comment": null, "links": []}
{"entry_id": "2312.15915", "title": "ChartBench: A Benchmark for Complex Visual Reasoning in Charts", "authors": ["Zhengzhuo Xu", "Sinan Du", "Yiyan Qi", "Chengjin Xu", "Chun Yuan", "Jian Guo"], "published": "2023-12-26 07:20:55", "updated": "2024-06-19 03:58:32", "summary": "Multimodal Large Language Models (MLLMs) have shown impressive capabilities\nin image understanding and generation. However, current benchmarks fail to\naccurately evaluate the chart comprehension of MLLMs due to limited chart types\nand inappropriate metrics. To address this, we propose ChartBench, a\ncomprehensive benchmark designed to assess chart comprehension and data\nreliability through complex visual reasoning. ChartBench includes 42\ncategories, 66.6k charts, and 600k question-answer pairs. Notably, many charts\nlack data point annotations, which requires MLLMs to derive values similar to\nhuman understanding by leveraging inherent chart elements such as color,\nlegends, and coordinate systems. We also design an enhanced evaluation metric,\nAcc+, to evaluate MLLMs without extensive manual or costly LLM-based\nevaluations. Furthermore, we propose two baselines based on the chain of\nthought and supervised fine-tuning to improve model performance on unannotated\ncharts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary\nMLLMs reveal their limitations in chart comprehension and offer valuable\ninsights for further research. Code and dataset are publicly available at\nhttps://chartbench.github.io.", "comment": null, "links": []}
{"entry_id": "2406.12736", "title": "Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning", "authors": ["Zhuohang Jiang", "Bingkui Tong", "Xia Du", "Ahmed Alhammadi", "Jizhe Zhou"], "published": "2024-06-18 15:58:22", "updated": "2024-06-18 15:58:22", "summary": "The Privacy-sensitive Object Identification (POI) task allocates bounding\nboxes for privacy-sensitive objects in a scene. The key to POI is settling an\nobject's privacy class (privacy-sensitive or non-privacy-sensitive). In\ncontrast to conventional object classes which are determined by the visual\nappearance of an object, one object's privacy class is derived from the scene\ncontexts and is subject to various implicit factors beyond its visual\nappearance. That is, visually similar objects may be totally opposite in their\nprivacy classes. To explicitly derive the objects' privacy class from the scene\ncontexts, in this paper, we interpret the POI task as a visual reasoning task\naimed at the privacy of each object in the scene. Following this\ninterpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard\ncontains three stages. i) Structuring: an unstructured image is first converted\ninto a structured, heterogeneous scene graph that embeds rich scene contexts.\nii) Data Augmentation: a contextual perturbation oversampling strategy is\nproposed to create slightly perturbed privacy-sensitive objects in a scene\ngraph, thereby balancing the skewed distribution of privacy classes. iii)\nHybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is\nthen transformed into a hybrid graph by endowing it with extra \"node-node\" and\n\"edge-edge\" homogeneous paths. These homogeneous paths allow direct message\npassing between nodes or edges, thereby accelerating reasoning and facilitating\nthe capturing of subtle context changes. Based on this hybrid graph... **For\nthe full abstract, see the original paper.**", "comment": "15 pages", "links": []}
{"entry_id": "2406.12479", "title": "RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding", "authors": ["Linrui Xu", "Ling Zhao", "Wang Guo", "Qiujun Li", "Kewang Long", "Kaiqi Zou", "Yuhan Wang", "Haifeng Li"], "published": "2024-06-18 10:34:28", "updated": "2024-06-18 10:34:28", "summary": "The remote sensing image intelligence understanding model is undergoing a new\nprofound paradigm shift which has been promoted by multi-modal large language\nmodel (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to\nparadigm learning a pre-trained general foundation model followed by an\nadaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets,\nwhich have led to advances in RSI intelligence understanding in the last\ndecade, are no longer suitable for fire-new tasks. We argued that a new dataset\nmust be designed to lighten tasks with the following features: 1)\nGeneralization: training model to learn shared knowledge among tasks and to\nadapt to different tasks; 2) Understanding complex scenes: training model to\nunderstand the fine-grained attribute of the objects of interest, and to be\nable to describe the scene with natural language; 3) Reasoning: training model\nto be able to realize high-level visual reasoning. In this paper, we designed a\nhigh-quality, diversified, and unified multimodal instruction-following dataset\nfor RSI understanding produced by GPT-4V and existing datasets, which we called\nRS-GPT4V. To achieve generalization, we used a (Question, Answer) which was\ndeduced from GPT-4V via instruction-following to unify the tasks such as\ncaptioning and localization; To achieve complex scene, we proposed a\nhierarchical instruction description with local strategy in which the\nfine-grained attributes of the objects and their spatial relationships are\ndescribed and global strategy in which all the local information are integrated\nto yield detailed instruction descript; To achieve reasoning, we designed\nmultiple-turn QA pair to provide the reasoning ability for a model. The\nempirical results show that the fine-tuned MLLMs by RS-GPT4V can describe\nfine-grained information. The dataset is available at:\nhttps://github.com/GeoX-Lab/RS-GPT4V.", "comment": "14 pages, 6 figures, 4 tables", "links": []}
{"entry_id": "2403.19322", "title": "Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models", "authors": ["Jiaxing Chen", "Yuxuan Liu", "Dehu Li", "Xiang An", "Weimo Deng", "Ziyong Feng", "Yongle Zhao", "Yin Xie"], "published": "2024-03-28 11:26:30", "updated": "2024-06-18 05:57:14", "summary": "The rise of Multimodal Large Language Models (MLLMs), renowned for their\nadvanced instruction-following and reasoning capabilities, has significantly\npropelled the field of visual reasoning. However, due to limitations in their\nimage tokenization processes, most MLLMs struggle to capture fine details of\ntext and objects in images, especially in high-resolution samples. To overcome\nthis limitation, we introduce P2G, a novel framework for plug-and-play\ngrounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ\nexpert agents for on-the-fly grounding of reasoning into critical visual and\ntextual elements in images, thereby enabling deliberate reasoning through\nmultimodal prompting. Additionally, we develop P2GB, a benchmark designed to\nevaluate MLLMs' proficiency in understanding inter-object relationships and\ntextual content in challenging high-resolution images. Extensive experiments on\nvisual reasoning tasks demonstrate the superiority of P2G, achieving\nperformance comparable to GPT-4V on P2GB with a 7B backbone. Our work\nunderscores the potential of grounding reasoning with external agents in MLLMs,\npresenting a promising alternative to mere model scaling.", "comment": "15 pages, 8 figures", "links": []}
{"entry_id": "2311.17851", "title": "Leveraging VLM-Based Pipelines to Annotate 3D Objects", "authors": ["Rishabh Kabra", "Loic Matthey", "Alexander Lerchner", "Niloy J. Mitra"], "published": "2023-11-29 17:54:22", "updated": "2024-06-17 17:27:19", "summary": "Pretrained vision language models (VLMs) present an opportunity to caption\nunlabeled 3D objects at scale. The leading approach to summarize VLM\ndescriptions from different views of an object (Luo et al., 2023) relies on a\nlanguage model (GPT4) to produce the final output. This text-based aggregation\nis susceptible to hallucinations as it merges potentially contradictory\ndescriptions. We propose an alternative algorithm to marginalize over factors\nsuch as the viewpoint that affect the VLM's response. Instead of merging\ntext-only responses, we utilize the VLM's joint image-text likelihoods. We show\nour probabilistic aggregation is not only more reliable and efficient, but sets\nthe SoTA on inferring object types with respect to human-verified labels. The\naggregated annotations are also useful for conditional inference; they improve\ndownstream predictions (e.g., of object material) when the object's type is\nspecified as an auxiliary text-based input. Such auxiliary inputs allow\nablating the contribution of visual reasoning over visionless reasoning in an\nunsupervised setting. With these supervised and unsupervised evaluations, we\nshow how a VLM-based pipeline can be leveraged to produce reliable annotations\nfor 764K objects from the Objaverse dataset.", "comment": null, "links": []}
{"entry_id": "2403.18252", "title": "Beyond Embeddings: The Promise of Visual Table in Visual Reasoning", "authors": ["Yiwu Zhong", "Zi-Yuan Hu", "Michael R. Lyu", "Liwei Wang"], "published": "2024-03-27 04:49:23", "updated": "2024-06-17 09:57:09", "summary": "Visual representation learning has been a cornerstone in computer vision,\ninvolving typical forms such as visual embeddings, structural symbols, and\ntext-based representations. Despite the success of CLIP-type visual embeddings,\nthey often lack access to world knowledge critical for visual reasoning. In\nthis work, we propose Visual Table, a novel form of visual representation\ntailored for visual reasoning. Visual tables are constructed as hierarchical\ndescriptions of visual scenes, featuring a scene description and multiple\nobject-centric descriptions covering categories, attributes, and knowledge.\nThanks to the structural and textual formats, visual tables offer unique\nadvantages over mere visual embeddings, such as interpretability and\ncontrollable editing. Furthermore, they deliver instance-level world knowledge\nand detailed attributes that are essential for visual reasoning. To create\nvisual tables, we develop a generator trained on the dataset with collected,\nsmall-scale annotations. Extensive results on 11 visual reasoning benchmarks\ndemonstrate that the generated visual tables significantly outperform previous\nstructural and text-based representations. Moreover, they consistently enhance\nstate-of-the-art multimodal large language models across diverse benchmarks,\nshowcasing their potential for advancing visual reasoning tasks. Our code is\navailable at https://github.com/LaVi-Lab/Visual-Table.", "comment": "Project page: https://github.com/LaVi-Lab/Visual-Table", "links": []}
{"entry_id": "2406.11327", "title": "ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding", "authors": ["Tianren Ma", "Lingxi Xie", "Yunjie Tian", "Boyu Yang", "Yuan Zhang", "David Doermann", "Qixiang Ye"], "published": "2024-06-17 08:39:16", "updated": "2024-06-17 08:39:16", "summary": "An essential topic for multimodal large language models (MLLMs) is aligning\nvision and language concepts at a finer level. In particular, we devote efforts\nto encoding visual referential information for tasks such as referring and\ngrounding. Existing methods, including proxy encoding and geometry encoding,\nincorporate additional syntax to encode the object's location, bringing extra\nburdens in training MLLMs to communicate between language and vision. This\nstudy presents ClawMachine, offering a new methodology that notates an entity\ndirectly using the visual tokens. It allows us to unify the prompt and answer\nof visual referential tasks without additional syntax. Upon a joint\nvision-language vocabulary, ClawMachine unifies visual referring and grounding\ninto an auto-regressive format and learns with a decoder-only architecture.\nExperiments validate that our model achieves competitive performance across\nvisual referring and grounding tasks with a reduced demand for training data.\nAdditionally, ClawMachine demonstrates a native ability to integrate\nmulti-source information for complex visual reasoning, which prior MLLMs can\nhardly perform without specific adaptions.", "comment": "Project page: https://github.com/martian422/ClawMachine", "links": []}
{"entry_id": "2406.11068", "title": "A Unified View of Abstract Visual Reasoning Problems", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2024-06-16 20:52:44", "updated": "2024-06-16 20:52:44", "summary": "The field of Abstract Visual Reasoning (AVR) encompasses a wide range of\nproblems, many of which are inspired by human IQ tests. The variety of AVR\ntasks has resulted in state-of-the-art AVR methods being task-specific\napproaches. Furthermore, contemporary methods consider each AVR problem\ninstance not as a whole, but in the form of a set of individual panels with\nparticular locations and roles (context vs. answer panels) pre-assigned\naccording to the task-specific arrangements. While these highly specialized\napproaches have recently led to significant progress in solving particular AVR\ntasks, considering each task in isolation hinders the development of universal\nlearning systems in this domain. In this paper, we introduce a unified view of\nAVR tasks, where each problem instance is rendered as a single image, with no a\npriori assumptions about the number of panels, their location, or role. The\nmain advantage of the proposed unified view is the ability to develop universal\nlearning models applicable to various AVR tasks. What is more, the proposed\napproach inherently facilitates transfer learning in the AVR domain, as various\ntypes of problems share a common representation. The experiments conducted on\nfour AVR datasets with Raven's Progressive Matrices and Visual Analogy\nProblems, and one real-world visual analogy dataset show that the proposed\nunified representation of AVR tasks poses a challenge to state-of-the-art Deep\nLearning (DL) AVR models and, more broadly, contemporary DL image recognition\nmethods. In order to address this challenge, we introduce the Unified Model for\nAbstract Visual Reasoning (UMAVR) capable of dealing with various types of AVR\nproblems in a unified manner. UMAVR outperforms existing AVR methods in\nselected single-task learning experiments, and demonstrates effective knowledge\nreuse in transfer learning and curriculum learning setups.", "comment": null, "links": []}
{"entry_id": "2406.11061", "title": "Generalization and Knowledge Transfer in Abstract Visual Reasoning Models", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2024-06-16 20:26:38", "updated": "2024-06-16 20:26:38", "summary": "We study generalization and knowledge reuse capabilities of deep neural\nnetworks in the domain of abstract visual reasoning (AVR), employing Raven's\nProgressive Matrices (RPMs), a recognized benchmark task for assessing AVR\nabilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset\nare investigated. Firstly, inspired by generalization assessment capabilities\nof the PGM dataset and popularity of I-RAVEN, we introduce\nAttributeless-I-RAVEN, a benchmark with four generalization regimes that allow\nto test generalization of abstract rules applied to held-out attributes.\nSecondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel\ncomponent structure comprising line-based patterns, facilitating assessment of\nprogressive knowledge acquisition in transfer learning setting. The developed\nbenchmarks reveal shortcomings of the contemporary deep learning models, which\nwe partly address with Pathways of Normalized Group Convolution (PoNG) model, a\nnovel neural architecture for solving AVR tasks. PoNG excels in both presented\nchallenges, as well as the standard I-RAVEN and PGM setups.", "comment": null, "links": []}
{"entry_id": "2406.10424", "title": "What is the Visual Cognition Gap between Humans and Multimodal LLMs?", "authors": ["Xu Cao", "Bolin Lai", "Wenqian Ye", "Yunsheng Ma", "Joerg Heintz", "Jintai Chen", "Jianguo Cao", "James M. Rehg"], "published": "2024-06-14 22:02:21", "updated": "2024-06-14 22:02:21", "summary": "Recently, Multimodal Large Language Models (MLLMs) have shown great promise\nin language-guided perceptual tasks such as recognition, segmentation, and\nobject detection. However, their effectiveness in addressing visual cognition\nproblems that require high-level reasoning is not well-established. One such\nchallenge is abstract visual reasoning (AVR) -- the cognitive ability to\ndiscern relationships among patterns in a set of images and extrapolate to\npredict subsequent patterns. This skill is crucial during the early\nneurodevelopmental stages of children. Inspired by the AVR tasks in Raven's\nProgressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC),\nwe propose a new dataset MaRs-VQA and a new benchmark VCog-Bench containing\nthree datasets to evaluate the zero-shot AVR capability of MLLMs and compare\ntheir performance with existing human intelligent investigation. Our\ncomparative experiments with different open-source and closed-source MLLMs on\nthe VCog-Bench revealed a gap between MLLMs and human intelligence,\nhighlighting the visual cognitive limitations of current MLLMs. We believe that\nthe public release of VCog-Bench, consisting of MaRs-VQA, and the inference\npipeline will drive progress toward the next generation of MLLMs with\nhuman-like visual cognition abilities.", "comment": "14 pages, 4 figures, the appendix will be updated soon", "links": []}
{"entry_id": "2305.17455", "title": "CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers", "authors": ["Dachuan Shi", "Chaofan Tao", "Anyi Rao", "Zhendong Yang", "Chun Yuan", "Jiaqi Wang"], "published": "2023-05-27 12:07:21", "updated": "2024-06-13 19:15:53", "summary": "Recent vision-language models have achieved tremendous advances. However,\ntheir computational costs are also escalating dramatically, making model\nacceleration exceedingly critical. To pursue more efficient vision-language\nTransformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET),\na general acceleration framework for vision-language Transformers. This\nframework adaptively combines tokens in real-time during inference,\nsignificantly reducing computational costs while maintaining high performance.\nCrossGET features two primary innovations: 1) Cross-Guided Matching and\nEnsemble. CrossGET leverages cross-modal guided token matching and ensemble to\neffectively utilize cross-modal information, achieving wider applicability\nacross both modality-independent models, e.g., CLIP, and modality-dependent\nones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an\nalgorithm for the token-matching mechanism, ensuring reliable matching results\nwhile facilitating parallelizability and high efficiency. Extensive experiments\nhave been conducted on various vision-language tasks, such as image-text\nretrieval, visual reasoning, image captioning, and visual question answering.\nThe performance on both classic multimodal architectures and emerging\nmultimodal LLMs demonstrates the framework's effectiveness and versatility. The\ncode is available at https://github.com/sdc17/CrossGET.", "comment": "ICML 2024. Code: https://github.com/sdc17/CrossGET", "links": []}
{"entry_id": "2406.09240", "title": "Comparison Visual Instruction Tuning", "authors": ["Wei Lin", "Muhammad Jehanzeb Mirza", "Sivan Doveh", "Rogerio Feris", "Raja Giryes", "Sepp Hochreiter", "Leonid Karlinsky"], "published": "2024-06-13 15:43:59", "updated": "2024-06-13 15:43:59", "summary": "Comparing two images in terms of Commonalities and Differences (CaD) is a\nfundamental human capability that forms the basis of advanced visual reasoning\nand interpretation. It is essential for the generation of detailed and\ncontextually relevant descriptions, performing comparative analysis, novelty\ndetection, and making informed decisions based on visual data. However,\nsurprisingly, little attention has been given to these fundamental concepts in\nthe best current mimic of human visual intelligence - Large Multimodal Models\n(LMMs). We develop and contribute a new two-phase approach CaD-VI for\ncollecting synthetic visual instructions, together with an\ninstruction-following dataset CaD-Inst containing 349K image pairs with CaD\ninstructions collected using CaD-VI. Our approach significantly improves the\nCaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of\nrelated tasks by up to 17.5%. It is also complementary to existing\ndifference-only instruction datasets, allowing automatic targeted refinement of\nthose resources increasing their effectiveness for CaD tuning by up to 10%.\nAdditionally, we propose an evaluation benchmark with 7.5K open-ended QAs to\nassess the CaD understanding abilities of LMMs.", "comment": "Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset\n  repo: https://huggingface.co/datasets/wlin21at/CaD-Inst", "links": []}
{"entry_id": "2406.09105", "title": "INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance", "authors": ["Chenwei Lin", "Hanjia Lyu", "Xian Xu", "Jiebo Luo"], "published": "2024-06-13 13:31:49", "updated": "2024-06-13 13:31:49", "summary": "Large Vision-Language Models (LVLMs) have demonstrated outstanding\nperformance in various general multimodal applications such as image\nrecognition and visual reasoning, and have also shown promising potential in\nspecialized domains. However, the application potential of LVLMs in the\ninsurance domain-characterized by rich application scenarios and abundant\nmultimodal data-has not been effectively explored. There is no systematic\nreview of multimodal tasks in the insurance domain, nor a benchmark\nspecifically designed to evaluate the capabilities of LVLMs in insurance. This\ngap hinders the development of LVLMs within the insurance domain. In this\npaper, we systematically review and distill multimodal tasks for four\nrepresentative types of insurance: auto insurance, property insurance, health\ninsurance, and agricultural insurance. We propose INS-MMBench, the first\ncomprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench\ncomprises a total of 2.2K thoroughly designed multiple-choice questions,\ncovering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate\nmultiple representative LVLMs, including closed-source models such as GPT-4o\nand open-source models like BLIP-2. This evaluation not only validates the\neffectiveness of our benchmark but also provides an in-depth performance\nanalysis of current LVLMs on various multimodal tasks in the insurance domain.\nWe hope that INS-MMBench will facilitate the further application of LVLMs in\nthe insurance domain and inspire interdisciplinary development. Our dataset and\nevaluation code are available at https://github.com/FDU-INS/INS-MMBench.", "comment": null, "links": []}
{"entry_id": "2403.03173", "title": "Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-03-05 18:08:29", "updated": "2024-06-13 09:41:55", "summary": "Visual abstract reasoning problems pose significant challenges to the\nperception and cognition abilities of artificial intelligence algorithms,\ndemanding deeper pattern recognition and inductive reasoning beyond mere\nidentification of explicit image features. Research advancements in this field\noften provide insights and technical support for other similar domains. In this\nstudy, we introduce PMoC, a deep-learning-based probabilistic model, achieving\nhigh reasoning accuracy in the Bongard-Logo, which stands as one of the most\nchallenging clustering reasoning tasks. PMoC is a novel approach for\nconstructing probabilistic models based on deep learning, which is distinctly\ndifferent from previous techniques. PMoC revitalizes the probabilistic\napproach, which has been relatively weak in visual abstract reasoning. As a\nbonus, we also designed Pose-Transformer for complex visual abstract reasoning\ntasks. Inspired by capsule networks, it focuses on positional relationships in\nimage data, boosting accuracy when combined with PMoC. Our Pose-Transformer\neffectively addresses reasoning difficulties associated with changes in the\nposition of entities, outperforming previous models on RAVEN dataset, and the\nPGM dataset. RAVEN and PGM represent two significant progressive pattern\nreasoning problems. Finally, considering the deployment difficulties of\nPose-Transformer, we introduced Straw-Pose-Transformer, a lightweight version.\nThis study contributes to enhancing the capabilities of artificial intelligence\nin abstract reasoning, cognitive pattern, and probabilistic modeling of complex\nsystems.", "comment": "14 pages, 17 figures, 4 tables", "links": []}
{"entry_id": "2406.07549", "title": "A3VLM: Actionable Articulation-Aware Vision Language Model", "authors": ["Siyuan Huang", "Haonan Chang", "Yuhan Liu", "Yimeng Zhu", "Hao Dong", "Peng Gao", "Abdeslam Boularias", "Hongsheng Li"], "published": "2024-06-11 17:59:55", "updated": "2024-06-13 08:16:05", "summary": "Vision Language Models (VLMs) have received significant attention in recent\nyears in the robotics community. VLMs are shown to be able to perform complex\nvisual reasoning and scene understanding tasks, which makes them regarded as a\npotential universal solution for general robotics problems such as manipulation\nand navigation. However, previous VLMs for robotics such as RT-1, RT-2, and\nManipLLM have focused on directly learning robot-centric actions. Such\napproaches require collecting a significant amount of robot interaction data,\nwhich is extremely costly in the real world. Thus, we propose A3VLM, an\nobject-centric, actionable, articulation-aware vision language model. A3VLM\nfocuses on the articulation structure and action affordances of objects. Its\nrepresentation is robot-agnostic and can be translated into robot actions using\nsimple action primitives. Extensive experiments in both simulation benchmarks\nand real-world settings demonstrate the effectiveness and stability of A3VLM.\nWe release our code and other materials at\nhttps://github.com/changhaonan/A3VLM.", "comment": null, "links": []}
{"entry_id": "2305.01928", "title": "Visual Transformation Telling", "authors": ["Wanqing Cui", "Xin Hong", "Yanyan Lan", "Liang Pang", "Jiafeng Guo", "Xueqi Cheng"], "published": "2023-05-03 07:02:57", "updated": "2024-06-11 08:49:25", "summary": "Humans can naturally reason from superficial state differences (e.g. ground\nwetness) to transformations descriptions (e.g. raining) according to their life\nexperience. In this paper, we propose a new visual reasoning task to test this\ntransformation reasoning ability in real-world scenarios, called\n\\textbf{V}isual \\textbf{T}ransformation \\textbf{T}elling (VTT). Given a series\nof states (i.e. images), VTT requires to describe the transformation occurring\nbetween every two adjacent states. Different from existing visual reasoning\ntasks that focus on surface state reasoning, the advantage of VTT is that it\ncaptures the underlying causes, e.g. actions or events, behind the differences\namong states. We collect a novel dataset to support the study of transformation\nreasoning from two existing instructional video datasets, CrossTask and COIN,\ncomprising 13,547 samples. Each sample involves the key state images along with\ntheir transformation descriptions. Our dataset covers diverse real-world\nactivities, providing a rich resource for training and evaluation. To construct\nan initial benchmark for VTT, we test several models, including traditional\nvisual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal\nlarge language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o,\nand GPT-4). Experimental results reveal that even state-of-the-art models still\nface challenges in VTT, highlighting substantial areas for improvement.", "comment": null, "links": []}
{"entry_id": "2406.06865", "title": "Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems", "authors": ["Mohammed Elhenawy", "Ahmed Abdelhay", "Taqwa I. Alhadidi", "Huthaifa I Ashqar", "Shadi Jaradat", "Ahmed Jaber", "Sebastien Glaser", "Andry Rakotonirainy"], "published": "2024-06-11 00:41:08", "updated": "2024-06-11 00:41:08", "summary": "Multimodal Large Language Models (MLLMs) have demonstrated proficiency in\nprocessing di-verse modalities, including text, images, and audio. These models\nleverage extensive pre-existing knowledge, enabling them to address complex\nproblems with minimal to no specific training examples, as evidenced in\nfew-shot and zero-shot in-context learning scenarios. This paper investigates\nthe use of MLLMs' visual capabilities to 'eyeball' solutions for the Traveling\nSalesman Problem (TSP) by analyzing images of point distributions on a\ntwo-dimensional plane. Our experiments aimed to validate the hypothesis that\nMLLMs can effectively 'eyeball' viable TSP routes. The results from zero-shot,\nfew-shot, self-ensemble, and self-refine zero-shot evaluations show promising\noutcomes. We anticipate that these findings will inspire further exploration\ninto MLLMs' visual reasoning abilities to tackle other combinatorial problems.", "comment": null, "links": []}
{"entry_id": "2406.05722", "title": "ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition", "authors": ["Sanjoy Kundu", "Shubham Trehan", "Sathyanarayanan N. Aakur"], "published": "2024-06-09 10:30:04", "updated": "2024-06-09 10:30:04", "summary": "Learning to infer labels in an open world, i.e., in an environment where the\ntarget \"labels\" are unknown, is an important characteristic for achieving\nautonomy. Foundation models pre-trained on enormous amounts of data have shown\nremarkable generalization skills through prompting, particularly in zero-shot\ninference. However, their performance is restricted to the correctness of the\ntarget label's search space. In an open world, this target search space can be\nunknown or exceptionally large, which severely restricts the performance of\nsuch models. To tackle this challenging problem, we propose a neuro-symbolic\nframework called ALGO - Action Learning with Grounded Object recognition that\nuses symbolic knowledge stored in large-scale knowledge bases to infer\nactivities in egocentric videos with limited supervision using two steps.\nFirst, we propose a neuro-symbolic prompting approach that uses object-centric\nvision-language models as a noisy oracle to ground objects in the video through\nevidence-based reasoning. Second, driven by prior commonsense knowledge, we\ndiscover plausible activities through an energy-based symbolic pattern theory\nframework and learn to ground knowledge-based action (verb) concepts in the\nvideo. Extensive experiments on four publicly available datasets\n(EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus) demonstrate its performance on\nopen-world activity inference.", "comment": "Extended abstract of arXiv:2305.16602 for CVPR EgoVis Workshop", "links": []}
{"entry_id": "2406.00977", "title": "Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model", "authors": ["Kezhen Chen", "Rahul Thapa", "Rahul Chalamala", "Ben Athiwaratkun", "Shuaiwen Leon Song", "James Zou"], "published": "2024-06-03 04:17:12", "updated": "2024-06-03 04:17:12", "summary": "Recent advances in large multimodal models (LMMs) suggest that higher image\nresolution enhances the fine-grained understanding of image details, crucial\nfor tasks such as visual commonsense reasoning and analyzing biomedical images.\nHowever, increasing input resolution poses two main challenges: 1) It extends\nthe context length required by the language model, leading to inefficiencies\nand hitting the model's context limit; 2) It increases the complexity of visual\nfeatures, necessitating more training data or more complex architecture. We\nintroduce Dragonfly, a new LMM architecture that enhances fine-grained visual\nunderstanding and reasoning about image regions to address these challenges.\nDragonfly employs two key strategies: multi-resolution visual encoding and\nzoom-in patch selection. These strategies allow the model to process\nhigh-resolution images efficiently while maintaining reasonable context length.\nOur experiments on eight popular benchmarks demonstrate that Dragonfly achieves\ncompetitive or better performance compared to other architectures, highlighting\nthe effectiveness of our design. Additionally, we finetuned Dragonfly on\nbiomedical instructions, achieving state-of-the-art results on multiple\nbiomedical tasks requiring fine-grained visual understanding, including 92.3%\naccuracy on the Path-VQA dataset (compared to 83.3% for Med-Gemini) and the\nhighest reported results on biomedical image captioning. To support model\ntraining, we curated a visual instruction-tuning dataset with 5.5 million\nimage-instruction samples in the general domain and 1.4 million samples in the\nbiomedical domain. We also conducted ablation studies to characterize the\nimpact of various architectural designs and image resolutions, providing\ninsights for future research on visual instruction alignment. The codebase and\nmodel are available at https://github.com/togethercomputer/Dragonfly.", "comment": null, "links": []}
{"entry_id": "2403.03458", "title": "Slot Abstractors: Toward Scalable Abstract Visual Reasoning", "authors": ["Shanka Subhra Mondal", "Jonathan D. Cohen", "Taylor W. Webb"], "published": "2024-03-06 04:49:02", "updated": "2024-06-02 23:04:43", "summary": "Abstract visual reasoning is a characteristically human ability, allowing the\nidentification of relational patterns that are abstracted away from object\nfeatures, and the systematic generalization of those patterns to unseen\nproblems. Recent work has demonstrated strong systematic generalization in\nvisual reasoning tasks involving multi-object inputs, through the integration\nof slot-based methods used for extracting object-centric representations\ncoupled with strong inductive biases for relational abstraction. However, this\napproach was limited to problems containing a single rule, and was not scalable\nto visual reasoning problems containing a large number of objects. Other recent\nwork proposed Abstractors, an extension of Transformers that incorporates\nstrong relational inductive biases, thereby inheriting the Transformer's\nscalability and multi-head architecture, but it has yet to be demonstrated how\nthis approach might be applied to multi-object visual inputs. Here we combine\nthe strengths of the above approaches and propose Slot Abstractors, an approach\nto abstract visual reasoning that can be scaled to problems involving a large\nnumber of objects and multiple relations among them. The approach displays\nstate-of-the-art performance across four abstract visual reasoning tasks, as\nwell as an abstract reasoning task involving real-world images.", "comment": "18 pages, 9 figures", "links": []}
{"entry_id": "2307.03601", "title": "GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest", "authors": ["Shilong Zhang", "Peize Sun", "Shoufa Chen", "Min Xiao", "Wenqi Shao", "Wenwei Zhang", "Yu Liu", "Kai Chen", "Ping Luo"], "published": "2023-07-07 13:43:44", "updated": "2024-06-01 08:50:14", "summary": "Visual instruction tuning large language model(LLM) on image-text pairs has\nachieved general-purpose vision-language abilities. However, the lack of\nregion-text pairs limits their advancements to fine-grained multimodal\nunderstanding. In this paper, we propose spatial instruction tuning, which\nintroduces the reference to the region-of-interest(RoI) in the instruction.\nBefore sending to LLM, the reference is replaced by RoI features and\ninterleaved with language embeddings as a sequence. Our model GPT4RoI, trained\non 7 region-text pair datasets, brings an unprecedented interactive and\nconversational experience compared to previous image-level models. (1)\nInteraction beyond language: Users can interact with our model by both language\nand drawing bounding boxes to flexibly adjust the referring granularity. (2)\nVersatile multimodal abilities: A variety of attribute information within each\nRoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc.\nFurthermore, it can reason about multiple RoIs based on common sense. On the\nVisual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable\naccuracy of 81.6%, surpassing all existing models by a significant margin (the\nsecond place is 75.6%) and almost reaching human-level performance of 85.0%.\nThe code, dataset, and demo can be found at\nhttps://github.com/jshilong/GPT4RoI.", "comment": "Code has been released at https://github.com/jshilong/GPT4RoI", "links": []}
{"entry_id": "2405.13872", "title": "Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models", "authors": ["Qiji Zhou", "Ruochen Zhou", "Zike Hu", "Panzhong Lu", "Siyang Gao", "Yue Zhang"], "published": "2024-05-22 17:56:51", "updated": "2024-05-29 02:24:36", "summary": "Recent advancements in Chain-of-Thought (CoT) and related rationale-based\nworks have significantly improved the performance of Large Language Models\n(LLMs) in complex reasoning tasks. With the evolution of Multimodal Large\nLanguage Models (MLLMs), enhancing their capability to tackle complex\nmultimodal reasoning problems is a crucial frontier. However, incorporating\nmultimodal rationales in CoT has yet to be thoroughly investigated. We propose\nthe Image-of-Thought (IoT) prompting method, which helps MLLMs to extract\nvisual rationales step-by-step. Specifically, IoT prompting can automatically\ndesign critical visual information extraction operations based on the input\nimages and questions. Each step of visual information refinement identifies\nspecific visual rationales that support answers to complex visual reasoning\nquestions. Beyond the textual CoT, IoT simultaneously utilizes visual and\ntextual rationales to help MLLMs understand complex multimodal information. IoT\nprompting has improved zero-shot visual reasoning performance across various\nvisual understanding tasks in different MLLMs. Moreover, the step-by-step\nvisual feature explanations generated by IoT prompting elucidate the visual\nreasoning process, aiding in analyzing the cognitive processes of large\nmultimodal models", "comment": "Correct the case title", "links": []}
{"entry_id": "2405.18358", "title": "MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning", "authors": ["Somnath Kumar", "Yash Gadhia", "Tanuja Ganu", "Akshay Nambi"], "published": "2024-05-28 16:55:41", "updated": "2024-05-28 16:55:41", "summary": "Recent advancements in Multi-modal Large Language Models (MLLMs) have\nsignificantly improved their performance in tasks combining vision and\nlanguage. However, challenges persist in detailed multi-modal understanding,\ncomprehension of complex tasks, and reasoning over multi-modal information.\nThis paper introduces MMCTAgent, a novel multi-modal critical thinking agent\nframework designed to address the inherent limitations of current MLLMs in\ncomplex visual reasoning tasks. Inspired by human cognitive processes and\ncritical thinking, MMCTAgent iteratively analyzes multi-modal information,\ndecomposes queries, plans strategies, and dynamically evolves its reasoning.\nAdditionally, MMCTAgent incorporates critical thinking elements such as\nverification of final answers and self-reflection through a novel approach that\ndefines a vision-based critic and identifies task-specific evaluation criteria,\nthereby enhancing its decision-making abilities. Through rigorous evaluations\nacross various image and video understanding benchmarks, we demonstrate that\nMMCTAgent (with and without the critic) outperforms both foundational MLLMs and\nother tool-augmented pipelines.", "comment": null, "links": []}
{"entry_id": "2405.16934", "title": "Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR", "authors": ["Zhenyang Li", "Yangyang Guo", "Kejie Wang", "Xiaolin Chen", "Liqiang Nie", "Mohan Kankanhalli"], "published": "2024-05-27 08:26:58", "updated": "2024-05-27 08:26:58", "summary": "Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind\nquestion answering over visual scenes. To achieve this goal, a model is\nrequired to provide an acceptable rationale as the reason for the predicted\nanswers. Progress on the benchmark dataset stems largely from the recent\nadvancement of Vision-Language Transformers (VL Transformers). These models are\nfirst pre-trained on some generic large-scale vision-text datasets, and then\nthe learned representations are transferred to the downstream VCR task. Despite\ntheir attractive performance, this paper posits that the VL Transformers do not\nexhibit visual commonsense, which is the key to VCR. In particular, our\nempirical results pinpoint several shortcomings of existing VL Transformers:\nsmall gains from pre-training, unexpected language bias, limited model\narchitecture for the two inseparable sub-tasks, and neglect of the important\nobject-tag correlation. With these findings, we tentatively suggest some future\ndirections from the aspect of dataset, evaluation metric, and training tricks.\nWe believe this work could make researchers revisit the intuition and goals of\nVCR, and thus help tackle the remaining challenges in visual reasoning.", "comment": null, "links": []}
{"entry_id": "2403.01031", "title": "Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks", "authors": ["Fakhraddin Alwajih", "El Moatez Billah Nagoudi", "Gagan Bhatia", "Abdelrahman Mohamed", "Muhammad Abdul-Mageed"], "published": "2024-03-01 23:38:02", "updated": "2024-05-24 20:24:36", "summary": "Multimodal large language models (MLLMs) have proven effective in a wide\nrange of tasks requiring complex reasoning and linguistic comprehension.\nHowever, due to a lack of high-quality multimodal resources in languages other\nthan English, success of MLLMs remains relatively limited to English-based\nsettings. This poses significant challenges in developing comparable models for\nother languages, including even those with large speaker populations such as\nArabic. To alleviate this challenge, we introduce a comprehensive family of\nArabic MLLMs, dubbed \\textit{Peacock}, with strong vision and language\ncapabilities. Through comprehensive qualitative and quantitative analysis, we\ndemonstrate the solid performance of our models on various visual reasoning\ntasks and further show their emerging dialectal potential. Additionally, we\nintroduce ~\\textit{Henna}, a new benchmark specifically designed for assessing\nMLLMs on aspects related to Arabic culture, setting the first stone for\nculturally-aware Arabic MLLMs.The GitHub repository for the \\textit{Peacock}\nproject is available at \\url{https://github.com/UBC-NLP/peacock}.", "comment": null, "links": []}
{"entry_id": "2403.15388", "title": "LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models", "authors": ["Yuzhang Shang", "Mu Cai", "Bingxin Xu", "Yong Jae Lee", "Yan Yan"], "published": "2024-03-22 17:59:52", "updated": "2024-05-22 20:50:37", "summary": "Large Multimodal Models (LMMs) have shown significant visual reasoning\ncapabilities by connecting a visual encoder and a large language model. LMMs\ntypically take in a fixed and large amount of visual tokens, such as the\npenultimate layer features in the CLIP visual encoder, as the prefix content.\nRecent LMMs incorporate more complex visual inputs, such as high-resolution\nimages and videos, which further increases the number of visual tokens\nsignificantly. However, due to the inherent design of the Transformer\narchitecture, the computational costs of these models tend to increase\nquadratically with the number of input tokens. To tackle this problem, we\nexplore a token reduction mechanism that identifies significant spatial\nredundancy among visual tokens. In response, we propose PruMerge, a novel\nadaptive visual token reduction strategy that significantly reduces the number\nof visual tokens without compromising the performance of LMMs. Specifically, to\nmetric the importance of each token, we exploit the sparsity observed in the\nvisual encoder, characterized by the sparse distribution of attention scores\nbetween the class token and visual tokens. This sparsity enables us to\ndynamically select the most crucial visual tokens to retain. Subsequently, we\ncluster the selected (unpruned) tokens based on their key similarity and merge\nthem with the unpruned tokens, effectively supplementing and enhancing their\ninformational content. Empirically, when applied to LLaVA-1.5, our approach can\ncompress the visual tokens by 14 times on average, and achieve comparable\nperformance across diverse visual question-answering and reasoning tasks. Code\nand checkpoints are at https://llava-prumerge.github.io/.", "comment": "Project page: https://llava-prumerge.github.io/", "links": []}
{"entry_id": "2402.04236", "title": "CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations", "authors": ["Ji Qi", "Ming Ding", "Weihan Wang", "Yushi Bai", "Qingsong Lv", "Wenyi Hong", "Bin Xu", "Lei Hou", "Juanzi Li", "Yuxiao Dong", "Jie Tang"], "published": "2024-02-06 18:43:48", "updated": "2024-05-22 17:10:29", "summary": "Vision-Language Models (VLMs) have demonstrated their broad effectiveness\nthanks to extensive training in aligning visual instructions to responses.\nHowever, such training of conclusive alignment leads models to ignore essential\nvisual reasoning, further resulting in failures in meticulous visual problems\nand unfaithful responses. Drawing inspiration from human cognition in solving\nvisual problems (e.g., marking, zoom in), this paper introduces Chain of\nManipulations, a mechanism that enables VLMs to solve problems step-by-step\nwith evidence. After training, models can solve various visual problems by\neliciting intrinsic manipulations (e.g., grounding, zoom in) with results\n(e.g., boxes, image) actively without involving external tools, while also\nallowing users to trace error causes. We study the roadmap to implement this\nmechanism, including (1) a flexible design of manipulations upon extensive\nanalysis, (2) an efficient automated data generation pipeline, (3) a compatible\nVLM architecture capable of multi-turn multi-image, and (4) a model training\nprocess for versatile capabilities. With the design, we also manually annotate\n6K high-quality samples for the challenging graphical mathematical problems.\nOur trained model, \\textbf{CogCoM}, equipped with this mechanism with 17B\nparameters achieves state-of-the-art performance across 9 benchmarks from 4\ncategories, demonstrating the effectiveness while preserving the\ninterpretability. Our code, model weights, and collected data are publicly\navailable at https://github.com/THUDM/CogCoM.", "comment": "19 pages, 9 figures", "links": []}
{"entry_id": "2310.05872", "title": "ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models", "authors": ["Kaiwen Zhou", "Kwonjoon Lee", "Teruhisa Misu", "Xin Eric Wang"], "published": "2023-10-09 17:10:35", "updated": "2024-05-17 17:24:35", "summary": "In our work, we explore the synergistic capabilities of pre-trained\nvision-and-language models (VLMs) and large language models (LLMs) on visual\ncommonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision\npipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit\nstrong performance for problems involving understanding the literal visual\ncontent, which we noted as visual commonsense understanding (VCU). For problems\nwhere the goal is to infer conclusions beyond image content, which we noted as\nvisual commonsense inference (VCI), VLMs face difficulties, while LLMs, given\nsufficient visual evidence, can use commonsense to infer the answer well. We\nempirically validate this by letting LLMs classify VCR problems into these two\ncategories and show the significant difference between VLM and LLM with image\ncaption decision pipelines on two subproblems. Moreover, we identify a\nchallenge with VLMs' passive perception, which may miss crucial context\ninformation, leading to incorrect reasoning by LLMs. Based on these, we suggest\na collaborative approach, named ViCor, where pre-trained LLMs serve as problem\nclassifiers to analyze the problem category, then either use VLMs to answer the\nquestion directly or actively instruct VLMs to concentrate on and gather\nrelevant visual elements to support potential commonsense inferences. We\nevaluate our framework on two VCR benchmark datasets and outperform all other\nmethods that do not require in-domain fine-tuning.", "comment": null, "links": []}
{"entry_id": "2405.10316", "title": "Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model", "authors": ["Zheng Gu", "Shiyuan Yang", "Jing Liao", "Jing Huo", "Yang Gao"], "published": "2024-05-16 17:59:21", "updated": "2024-05-16 17:59:21", "summary": "Visual In-Context Learning (ICL) has emerged as a promising research area due\nto its capability to accomplish various tasks with limited example pairs\nthrough analogical reasoning. However, training-based visual ICL has\nlimitations in its ability to generalize to unseen tasks and requires the\ncollection of a diverse task dataset. On the other hand, existing methods in\nthe inference-based visual ICL category solely rely on textual prompts, which\nfail to capture fine-grained contextual information from given examples and can\nbe time-consuming when converting from images to text prompts. To address these\nchallenges, we propose Analogist, a novel inference-based visual ICL approach\nthat exploits both visual and textual prompting techniques using a\ntext-to-image diffusion model pretrained for image inpainting. For visual\nprompting, we propose a self-attention cloning (SAC) method to guide the\nfine-grained structural-level analogy between image examples. For textual\nprompting, we leverage GPT-4V's visual reasoning capability to efficiently\ngenerate text prompts and introduce a cross-attention masking (CAM) operation\nto enhance the accuracy of semantic-level analogy guided by text prompts. Our\nmethod is out-of-the-box and does not require fine-tuning or optimization. It\nis also generic and flexible, enabling a wide range of visual tasks to be\nperformed in an in-context manner. Extensive experiments demonstrate the\nsuperiority of our method over existing approaches, both qualitatively and\nquantitatively.", "comment": "Project page: https://analogist2d.github.io", "links": []}
{"entry_id": "2401.01974", "title": "Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers", "authors": ["Aleksandar Stanić", "Sergi Caelles", "Michael Tschannen"], "published": "2024-01-03 20:48:47", "updated": "2024-05-14 22:43:40", "summary": "Visual reasoning is dominated by end-to-end neural networks scaled to\nbillions of model parameters and training examples. However, even the largest\nmodels struggle with compositional reasoning, generalization, fine-grained\nspatial and temporal reasoning, and counting. Visual reasoning with large\nlanguage models (LLMs) as controllers can, in principle, address these\nlimitations by decomposing the task and solving subtasks by orchestrating a set\nof (visual) tools. Recently, these models achieved great performance on tasks\nsuch as compositional visual question answering, visual grounding, and video\ntemporal reasoning. Nevertheless, in their current form, these models heavily\nrely on human engineering of in-context examples in the prompt, which are often\ndataset- and task-specific and require significant labor by highly skilled\nprogrammers. In this work, we present a framework that mitigates these issues\nby introducing spatially and temporally abstract routines and by leveraging a\nsmall number of labeled examples to automatically generate in-context examples,\nthereby avoiding human-created in-context examples. On a number of visual\nreasoning tasks, we show that our framework leads to consistent gains in\nperformance, makes LLMs as controllers setup more robust, and removes the need\nfor human engineering of in-context examples.", "comment": null, "links": []}
{"entry_id": "2405.07451", "title": "CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering", "authors": ["Yuanyuan Jiang", "Jianqin Yin"], "published": "2024-05-13 03:25:15", "updated": "2024-05-13 03:25:15", "summary": "While vision-language pretrained models (VLMs) excel in various multimodal\nunderstanding tasks, their potential in fine-grained audio-visual reasoning,\nparticularly for audio-visual question answering (AVQA), remains largely\nunexplored. AVQA presents specific challenges for VLMs due to the requirement\nof visual understanding at the region level and seamless integration with audio\nmodality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder\nbut underutilized its knowledge, and mistreated audio and video as separate\nentities in a dual-stream framework as most AVQA methods. This paper proposes a\nnew CLIP-powered target-aware single-stream (TASS) network for AVQA using the\nimage-text matching knowledge of the pretrained model through the audio-visual\nmatching characteristic of nature. It consists of two key components: the\ntarget-aware spatial grounding module (TSG+) and the single-stream joint\ntemporal grounding module (JTG). Specifically, we propose a TSG+ module to\ntransfer the image-text matching knowledge from CLIP models to our region-text\nmatching process without corresponding ground-truth labels. Moreover, unlike\nprevious separate dual-stream networks that still required an additional\naudio-visual fusion module, JTG unifies audio-visual fusion and question-aware\ntemporal grounding in a simplified single-stream architecture. It treats audio\nand video as a cohesive entity and further extends the pretrained image-text\nknowledge to audio-text matching by preserving their temporal correlation with\nour proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted\non the MUSIC-AVQA benchmark verified the effectiveness of our proposed method\nover existing state-of-the-art methods.", "comment": "Submitted to the Journal on February 6, 2024", "links": []}
{"entry_id": "2305.16602", "title": "Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning", "authors": ["Sanjoy Kundu", "Shubham Trehan", "Sathyanarayanan N. Aakur"], "published": "2023-05-26 03:21:30", "updated": "2024-05-03 14:01:22", "summary": "Learning to infer labels in an open world, i.e., in an environment where the\ntarget ``labels'' are unknown, is an important characteristic for achieving\nautonomy. Foundation models, pre-trained on enormous amounts of data, have\nshown remarkable generalization skills through prompting, particularly in\nzero-shot inference. However, their performance is restricted to the\ncorrectness of the target label's search space, i.e., candidate labels provided\nin the prompt. This target search space can be unknown or exceptionally large\nin an open world, severely restricting their performance. To tackle this\nchallenging problem, we propose a two-step, neuro-symbolic framework called\nALGO - Action Learning with Grounded Object recognition that uses symbolic\nknowledge stored in large-scale knowledge bases to infer activities in\negocentric videos with limited supervision. First, we propose a neuro-symbolic\nprompting approach that uses object-centric vision-language models as a noisy\noracle to ground objects in the video through evidence-based reasoning. Second,\ndriven by prior commonsense knowledge, we discover plausible activities through\nan energy-based symbolic pattern theory framework and learn to ground\nknowledge-based action (verb) concepts in the video. Extensive experiments on\nfour publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and\nCharades-Ego) demonstrate its performance on open-world activity inference. We\nalso show that ALGO can be extended to zero-shot inference and demonstrate its\ncompetitive performance on the Charades-Ego dataset.", "comment": "25 Pages, 4 figures, 3 tables", "links": []}
{"entry_id": "2405.00646", "title": "Learning to Compose: Improving Object Centric Learning by Injecting Compositionality", "authors": ["Whie Jung", "Jaehoon Yoo", "Sungjin Ahn", "Seunghoon Hong"], "published": "2024-05-01 17:21:36", "updated": "2024-05-01 17:21:36", "summary": "Learning compositional representation is a key aspect of object-centric\nlearning as it enables flexible systematic generalization and supports complex\nvisual reasoning. However, most of the existing approaches rely on\nauto-encoding objective, while the compositionality is implicitly imposed by\nthe architectural or algorithmic bias in the encoder. This misalignment between\nauto-encoding objective and learning compositionality often results in failure\nof capturing meaningful object representations. In this study, we propose a\nnovel objective that explicitly encourages compositionality of the\nrepresentations. Built upon the existing object-centric learning framework\n(e.g., slot attention), our method incorporates additional constraints that an\narbitrary mixture of object representations from two images should be valid by\nmaximizing the likelihood of the composite data. We demonstrate that\nincorporating our objective to the existing framework consistently improves the\nobjective-centric learning and enhances the robustness to the architectural\nchoices.", "comment": null, "links": []}
{"entry_id": "2404.19696", "title": "Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners", "authors": ["Chun Feng", "Joy Hsu", "Weiyu Liu", "Jiajun Wu"], "published": "2024-04-30 16:44:18", "updated": "2024-04-30 16:44:18", "summary": "3D visual grounding is a challenging task that often requires direct and\ndense supervision, notably the semantic label for each object in the scene. In\nthis paper, we instead study the naturally supervised setting that learns from\nonly 3D scene and QA pairs, where prior works underperform. We propose the\nLanguage-Regularized Concept Learner (LARC), which uses constraints from\nlanguage as regularization to significantly improve the accuracy of\nneuro-symbolic concept learners in the naturally supervised setting. Our\napproach is based on two core insights: the first is that language constraints\n(e.g., a word's relation to another) can serve as effective regularization for\nstructured representations in neuro-symbolic models; the second is that we can\nquery large language models to distill such constraints from language\nproperties. We show that LARC improves performance of prior works in naturally\nsupervised 3D visual grounding, and demonstrates a wide range of 3D visual\nreasoning capabilities-from zero-shot composition, to data efficiency and\ntransferability. Our method represents a promising step towards regularizing\nstructured visual reasoning frameworks with language-based priors, for learning\nin settings without dense supervision.", "comment": "CVPR 2024. The first two authors contributed equally", "links": []}
{"entry_id": "2312.00784", "title": "ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts", "authors": ["Mu Cai", "Haotian Liu", "Dennis Park", "Siva Karthik Mustikovela", "Gregory P. Meyer", "Yuning Chai", "Yong Jae Lee"], "published": "2023-12-01 18:59:56", "updated": "2024-04-27 01:53:39", "summary": "While existing large vision-language multimodal models focus on whole image\nunderstanding, there is a prominent gap in achieving region-specific\ncomprehension. Current approaches that use textual coordinates or spatial\nencodings often fail to provide a user-friendly interface for visual prompting.\nTo address this challenge, we introduce a novel multimodal model capable of\ndecoding arbitrary visual prompts. This allows users to intuitively mark images\nand interact with the model using natural cues like a \"red bounding box\" or\n\"pointed arrow\". Our simple design directly overlays visual markers onto the\nRGB image, eliminating the need for complex region encodings, yet achieves\nstate-of-the-art performance on region-understanding tasks like Visual7W,\nPointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present\nViP-Bench, a comprehensive benchmark to assess the capability of models in\nunderstanding visual prompts across multiple dimensions, enabling future\nresearch in this domain. Code, data, and model are publicly available.", "comment": "Accepted to CVPR2024. Project page: https://vip-llava.github.io/", "links": []}
{"entry_id": "2404.16375", "title": "List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs", "authors": ["An Yan", "Zhengyuan Yang", "Junda Wu", "Wanrong Zhu", "Jianwei Yang", "Linjie Li", "Kevin Lin", "Jianfeng Wang", "Julian McAuley", "Jianfeng Gao", "Lijuan Wang"], "published": "2024-04-25 07:29:17", "updated": "2024-04-25 07:29:17", "summary": "Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of\nGPT-4V, by enabling the model to associate visual objects with tags inserted on\nthe image. These tags, marked with alphanumerics, can be indexed via text\ntokens for easy reference. Despite the extraordinary performance from GPT-4V,\nwe observe that other Multimodal Large Language Models (MLLMs) struggle to\nunderstand these visual tags. To promote the learning of SoM prompting for\nopen-source models, we propose a new learning paradigm: \"list items one by\none,\" which asks the model to enumerate and describe all visual tags placed on\nthe image following the alphanumeric orders of tags. By integrating our curated\ndataset with other visual instruction tuning datasets, we are able to equip\nexisting MLLMs with the SoM prompting ability. Furthermore, we evaluate our\nfinetuned SoM models on five MLLM benchmarks. We find that this new dataset,\neven in a relatively small size (10k-30k images with tags), significantly\nenhances visual reasoning capabilities and reduces hallucinations for MLLMs.\nPerhaps surprisingly, these improvements persist even when the visual tags are\nomitted from input images during inference. This suggests the potential of\n\"list items one by one\" as a new paradigm for training MLLMs, which strengthens\nthe object-text alignment through the use of visual tags in the training stage.\nFinally, we conduct analyses by probing trained models to understand the\nworking mechanism of SoM. Our code and data are available at\n\\url{https://github.com/zzxslp/SoM-LLaVA}.", "comment": "Preprint", "links": []}
{"entry_id": "2404.13591", "title": "MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning", "authors": ["Yifan Jiang", "Jiarui Zhang", "Kexuan Sun", "Zhivar Sourati", "Kian Ahrabian", "Kaixin Ma", "Filip Ilievski", "Jay Pujara"], "published": "2024-04-21 09:15:02", "updated": "2024-04-24 22:32:10", "summary": "While multi-modal large language models (MLLMs) have shown significant\nprogress on many popular visual reasoning benchmarks, whether they possess\nabstract visual reasoning abilities remains an open question. Similar to the\nSudoku puzzles, abstract visual reasoning (AVR) problems require finding\nhigh-level patterns (e.g., repetition constraints) that control the input\nshapes (e.g., digits) in a specific task configuration (e.g., matrix). However,\nexisting AVR benchmarks only considered a limited set of patterns (addition,\nconjunction), input shapes (rectangle, square), and task configurations (3 by 3\nmatrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce\nMARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core\nknowledge patterns, geometric and abstract shapes, and five different task\nconfigurations. To inspect whether the model accuracy is grounded in perception\nand reasoning, MARVEL complements the general AVR question with perception\nquestions in a hierarchical evaluation framework. We conduct comprehensive\nexperiments on MARVEL with nine representative MLLMs in zero-shot and few-shot\nsettings. Our experiments reveal that all models show near-random performance\non the AVR question, with significant performance gaps (40%) compared to humans\nacross all patterns and task configurations. Further analysis of perception\nquestions reveals that MLLMs struggle to comprehend the visual features\n(near-random performance) and even count the panels in the puzzle ( <45%),\nhindering their ability for abstract reasoning. We release our entire code and\ndataset.", "comment": null, "links": []}
{"entry_id": "2404.16033", "title": "Cantor: Inspiring Multimodal Chain-of-Thought of MLLM", "authors": ["Timin Gao", "Peixian Chen", "Mengdan Zhang", "Chaoyou Fu", "Yunhang Shen", "Yan Zhang", "Shengchuan Zhang", "Xiawu Zheng", "Xing Sun", "Liujuan Cao", "Rongrong Ji"], "published": "2024-04-24 17:59:48", "updated": "2024-04-24 17:59:48", "summary": "With the advent of large language models(LLMs) enhanced by the\nchain-of-thought(CoT) methodology, visual reasoning problem is usually\ndecomposed into manageable sub-tasks and tackled sequentially with various\nexternal tools. However, such a paradigm faces the challenge of the potential\n\"determining hallucinations\" in decision-making due to insufficient visual\ninformation and the limitation of low-level perception tools that fail to\nprovide abstract summaries necessary for comprehensive reasoning. We argue that\nconverging visual context acquisition and logical reasoning is pivotal for\ntackling visual reasoning tasks. This paper delves into the realm of multimodal\nCoT to solve intricate visual reasoning tasks with multimodal large language\nmodels(MLLMs) and their cognitive capability. To this end, we propose an\ninnovative multimodal CoT framework, termed Cantor, characterized by a\nperception-decision architecture. Cantor first acts as a decision generator and\nintegrates visual inputs to analyze the image and problem, ensuring a closer\nalignment with the actual context. Furthermore, Cantor leverages the advanced\ncognitive functions of MLLMs to perform as multifaceted experts for deriving\nhigher-level information, enhancing the CoT generation process. Our extensive\nexperiments demonstrate the efficacy of the proposed framework, showing\nsignificant improvements in multimodal CoT performance across two complex\nvisual reasoning datasets, without necessitating fine-tuning or ground-truth\nrationales. Project Page: https://ggg0919.github.io/cantor/ .", "comment": "The project page is available at https://ggg0919.github.io/cantor/", "links": []}
{"entry_id": "2404.14705", "title": "Think-Program-reCtify: 3D Situated Reasoning with Large Language Models", "authors": ["Qingrong He", "Kejun Lin", "Shizhe Chen", "Anwen Hu", "Qin Jin"], "published": "2024-04-23 03:22:06", "updated": "2024-04-23 03:22:06", "summary": "This work addresses the 3D situated reasoning task which aims to answer\nquestions given egocentric observations in a 3D environment. The task remains\nchallenging as it requires comprehensive 3D perception and complex reasoning\nskills. End-to-end models trained on supervised data for 3D situated reasoning\nsuffer from data scarcity and generalization ability. Inspired by the recent\nsuccess of leveraging large language models (LLMs) for visual reasoning, we\npropose LLM-TPC, a novel framework that leverages the planning, tool usage, and\nreflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think\nphase first decomposes the compositional question into a sequence of steps, and\nthen the Program phase grounds each step to a piece of code and calls carefully\ndesigned 3D visual perception modules. Finally, the Rectify phase adjusts the\nplan and code if the program fails to execute. Experiments and analysis on the\nSQA3D benchmark demonstrate the effectiveness, interpretability and robustness\nof our method. Our code is publicly available at\nhttps://qingrongh.github.io/LLM-TPC/.", "comment": null, "links": []}
{"entry_id": "2404.13847", "title": "EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning", "authors": ["Mingjie Ma", "Zhihuan Yu", "Yichao Ma", "Guohui Li"], "published": "2024-04-22 03:05:32", "updated": "2024-04-22 03:05:32", "summary": "Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to\nanswer visual questions requiring human commonsense, and to provide rationales\nexplaining why the answers are correct. With emergence of Large Language Models\n(LLMs), it is natural and imperative to explore their applicability to VCR.\nHowever, VCR task demands more external knowledge to tackle its challenging\nquestions, necessitating special designs to activate LLMs' commonsense\nreasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction\nof entire input image, which makes it difficult to comprehend VCR's unique\nco-reference tags between image regions and text, posing challenges for\nfine-grained alignment. To address these issues, we propose EventLens that\nleverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR.\nFirst, by emulating the cognitive process of human reasoning, an Event-Aware\nPretraining auxiliary task is introduced to better activate LLM's global\ncomprehension of intricate scenarios. Second, during fine-tuning, we further\nutilize reference tags to bridge RoI features with texts, while preserving both\nmodality semantics. Finally, we use instruct-style prompts to narrow the gap\nbetween pretraining and fine-tuning, and task-specific adapters to better\nintegrate LLM's inherent knowledge with new commonsense. Experimental results\nshow the effectiveness of our proposed auxiliary task and fine-grained linking\nstrategy.", "comment": null, "links": []}
{"entry_id": "2404.10595", "title": "Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases", "authors": ["Yanze Li", "Wenhua Zhang", "Kai Chen", "Yanxin Liu", "Pengxiang Li", "Ruiyuan Gao", "Lanqing Hong", "Meng Tian", "Xinhai Zhao", "Zhenguo Li", "Dit-Yan Yeung", "Huchuan Lu", "Xu Jia"], "published": "2024-04-16 14:20:55", "updated": "2024-04-16 14:20:55", "summary": "Large Vision-Language Models (LVLMs), due to the remarkable visual reasoning\nability to understand images and videos, have received widespread attention in\nthe autonomous driving domain, which significantly advances the development of\ninterpretable end-to-end autonomous driving. However, current evaluations of\nLVLMs primarily focus on the multi-faceted capabilities in common scenarios,\nlacking quantifiable and automated assessment in autonomous driving contexts,\nlet alone severe road corner cases that even the state-of-the-art autonomous\ndriving perception systems struggle to handle. In this paper, we propose\nCODA-LM, a novel vision-language benchmark for self-driving, which provides the\nfirst automatic and quantitative evaluation of LVLMs for interpretable\nautonomous driving including general perception, regional perception, and\ndriving suggestions. CODA-LM utilizes the texts to describe the road images,\nexploiting powerful text-only large language models (LLMs) without image inputs\nto assess the capabilities of LVLMs in autonomous driving scenarios, which\nreveals stronger alignment with human preferences than LVLM judges. Experiments\ndemonstrate that even the closed-sourced commercial LVLMs like GPT-4V cannot\ndeal with road corner cases well, suggesting that we are still far from a\nstrong LVLM-powered intelligent driving agent, and we hope our CODA-LM can\nbecome the catalyst to promote future development.", "comment": "Project Page: https://coda-dataset.github.io/coda-lm/", "links": []}
{"entry_id": "2401.09966", "title": "Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection", "authors": ["Fan Shi", "Bin Li", "Xiangyang Xue"], "published": "2024-01-18 13:28:44", "updated": "2024-04-14 10:53:43", "summary": "Endowing machines with abstract reasoning ability has been a long-term\nresearch topic in artificial intelligence. Raven's Progressive Matrix (RPM) is\nwidely used to probe abstract visual reasoning in machine intelligence, where\nmodels will analyze the underlying rules and select one image from candidates\nto complete the image matrix. Participators of RPM tests can show powerful\nreasoning ability by inferring and combining attribute-changing rules and\nimagining the missing images at arbitrary positions of a matrix. However,\nexisting solvers can hardly manifest such an ability in realistic RPM tests. In\nthis paper, we propose a deep latent variable model for answer generation\nproblems through Rule AbstractIon and SElection (RAISE). RAISE can encode image\nattributes into latent concepts and abstract atomic rules that act on the\nlatent concepts. When generating answers, RAISE selects one atomic rule out of\nthe global knowledge set for each latent concept to constitute the underlying\nrule of an RPM. In the experiments of bottom-right and arbitrary-position\nanswer generation, RAISE outperforms the compared solvers in most\nconfigurations of realistic RPM datasets. In the odd-one-out task and two\nheld-out configurations, RAISE can leverage acquired latent concepts and atomic\nrules to find the rule-breaking image in a matrix and handle problems with\nunseen combinations of rules and attributes.", "comment": null, "links": []}
{"entry_id": "2404.06405", "title": "Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry", "authors": ["Shiven Sinha", "Ameya Prabhu", "Ponnurangam Kumaraguru", "Siddharth Bhat", "Matthias Bethge"], "published": "2024-04-09 15:54:00", "updated": "2024-04-11 14:37:29", "summary": "Proving geometric theorems constitutes a hallmark of visual reasoning\ncombining both intuitive and logical skills. Therefore, automated theorem\nproving of Olympiad-level geometry problems is considered a notable milestone\nin human-level automated reasoning. The introduction of AlphaGeometry, a\nneuro-symbolic model trained with 100 million synthetic samples, marked a major\nbreakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO)\nproblems whereas the reported baseline based on Wu's method solved only ten. In\nthis note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry,\nand find that Wu's method is surprisingly strong. Wu's method alone can solve\n15 problems, and some of them are not solved by any of the other methods. This\nleads to two key findings: (i) Combining Wu's method with the classic synthetic\nmethods of deductive databases and angle, ratio, and distance chasing solves 21\nout of 30 methods by just using a CPU-only laptop with a time limit of 5\nminutes per problem. Essentially, this classic method solves just 4 problems\nless than AlphaGeometry and establishes the first fully symbolic baseline\nstrong enough to rival the performance of an IMO silver medalist. (ii) Wu's\nmethod even solves 2 of the 5 problems that AlphaGeometry failed to solve.\nThus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art\nfor automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the\nfirst AI method which outperforms an IMO gold medalist.", "comment": "Work in Progress. Released for wider feedback", "links": []}
{"entry_id": "2306.13549", "title": "A Survey on Multimodal Large Language Models", "authors": ["Shukang Yin", "Chaoyou Fu", "Sirui Zhao", "Ke Li", "Xing Sun", "Tong Xu", "Enhong Chen"], "published": "2023-06-23 15:21:52", "updated": "2024-04-01 17:51:54", "summary": "Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has\nbeen a new rising research hotspot, which uses powerful Large Language Models\n(LLMs) as a brain to perform multimodal tasks. The surprising emergent\ncapabilities of MLLM, such as writing stories based on images and OCR-free math\nreasoning, are rare in traditional multimodal methods, suggesting a potential\npath to artificial general intelligence. To this end, both academia and\nindustry have endeavored to develop MLLMs that can compete with or even better\nthan GPT-4V, pushing the limit of research at a surprising speed. In this\npaper, we aim to trace and summarize the recent progress of MLLMs. First of\nall, we present the basic formulation of MLLM and delineate its related\nconcepts, including architecture, training strategy and data, as well as\nevaluation. Then, we introduce research topics about how MLLMs can be extended\nto support more granularity, modalities, languages, and scenarios. We continue\nwith multimodal hallucination and extended techniques, including Multimodal ICL\n(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To\nconclude the paper, we discuss existing challenges and point out promising\nresearch directions. In light of the fact that the era of MLLM has only just\nbegun, we will keep updating this survey and hope it can inspire more research.\nAn associated GitHub link collecting the latest papers is available at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.", "comment": "Project\n  page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models", "links": []}
{"entry_id": "2311.17076", "title": "Compositional Chain-of-Thought Prompting for Large Multimodal Models", "authors": ["Chancharik Mitra", "Brandon Huang", "Trevor Darrell", "Roei Herzig"], "published": "2023-11-27 22:23:27", "updated": "2024-04-01 03:17:09", "summary": "The combination of strong visual backbones and Large Language Model (LLM)\nreasoning has led to Large Multimodal Models (LMMs) becoming the current\nstandard for a wide range of vision and language (VL) tasks. However, recent\nresearch has shown that even the most advanced LMMs still struggle to capture\naspects of compositional visual reasoning, such as attributes and relationships\nbetween objects. One solution is to utilize scene graphs (SGs)--a formalization\nof objects and their relations and attributes that has been extensively used as\na bridge between the visual and textual domains. Yet, scene graph data requires\nscene graph annotations, which are expensive to collect and thus not easily\nscalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic\nforgetting of the pretraining objective. To overcome this, inspired by\nchain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a\nnovel zero-shot Chain-of-Thought prompting method that utilizes SG\nrepresentations in order to extract compositional knowledge from an LMM.\nSpecifically, we first generate an SG using the LMM, and then use that SG in\nthe prompt to produce a response. Through extensive experiments, we find that\nthe proposed CCoT approach not only improves LMM performance on several vision\nand language VL compositional benchmarks but also improves the performance of\nseveral popular LMMs on general multimodal benchmarks, without the need for\nfine-tuning or annotated ground-truth SGs. Code:\nhttps://github.com/chancharikmitra/CCoT", "comment": null, "links": []}
{"entry_id": "2403.14743", "title": "VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding", "authors": ["Ahmad Mahmood", "Ashmal Vayani", "Muzammal Naseer", "Salman Khan", "Fahad Shahbaz Khan"], "published": "2024-03-21 18:00:00", "updated": "2024-03-25 01:18:37", "summary": "Recent studies have demonstrated the effectiveness of Large Language Models\n(LLMs) as reasoning modules that can deconstruct complex tasks into more\nmanageable sub-tasks, particularly when applied to visual reasoning tasks for\nimages. In contrast, this paper introduces a Video Understanding and Reasoning\nFramework (VURF) based on the reasoning power of LLMs. Ours is a novel approach\nto extend the utility of LLMs in the context of video tasks, leveraging their\ncapacity to generalize from minimal input and output demonstrations within a\ncontextual framework. By presenting LLMs with pairs of instructions and their\ncorresponding high-level programs, we harness their contextual learning\ncapabilities to generate executable visual programs for video understanding. To\nenhance program's accuracy and robustness, we implement two important\nstrategies. Firstly, we employ a feedback-generation approach, powered by\nGPT-3.5, to rectify errors in programs utilizing unsupported functions.\nSecondly, taking motivation from recent works on self refinement of LLM\noutputs, we introduce an iterative procedure for improving the quality of the\nin-context examples by aligning the initial outputs to the outputs that would\nhave been generated had the LLM not been bound by the structure of the\nin-context examples. Our results on several video-specific tasks, including\nvisual QA, video anticipation, pose estimation and multi-video QA illustrate\nthe efficacy of these enhancements in improving the performance of visual\nprogramming approaches for video tasks.", "comment": null, "links": []}
{"entry_id": "2403.13666", "title": "Grounding Spatial Relations in Text-Only Language Models", "authors": ["Gorka Azkune", "Ander Salaberria", "Eneko Agirre"], "published": "2024-03-20 15:20:30", "updated": "2024-03-20 15:20:30", "summary": "This paper shows that text-only Language Models (LM) can learn to ground\nspatial relations like \"left of\" or \"below\" if they are provided with explicit\nlocation information of objects and they are properly trained to leverage those\nlocations. We perform experiments on a verbalized version of the Visual Spatial\nReasoning (VSR) dataset, where images are coupled with textual statements which\ncontain real or fake spatial relations between two objects of the image. We\nverbalize the images using an off-the-shelf object detector, adding location\ntokens to every object label to represent their bounding boxes in textual form.\nGiven the small size of VSR, we do not observe any improvement when using\nlocations, but pretraining the LM over a synthetic dataset automatically\nderived by us improves results significantly when using location tokens. We\nthus show that locations allow LMs to ground spatial relations, with our\ntext-only LMs outperforming Vision-and-Language Models and setting the new\nstate-of-the-art for the VSR dataset. Our analysis show that our text-only LMs\ncan generalize beyond the relations seen in the synthetic dataset to some\nextent, learning also more useful information than that encoded in the spatial\nrules we used to create the synthetic dataset itself.", "comment": "Accepted in Neural Networks", "links": ["http://dx.doi.org/10.1016/j.neunet.2023.11.031"]}
{"entry_id": "2309.04461", "title": "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models", "authors": ["Yangyi Chen", "Karan Sikka", "Michael Cogswell", "Heng Ji", "Ajay Divakaran"], "published": "2023-09-08 17:49:44", "updated": "2024-03-19 21:48:59", "summary": "Vision-language models (VLMs) have recently demonstrated strong efficacy as\nvisual assistants that can parse natural queries about the visual content and\ngenerate human-like outputs. In this work, we explore the ability of these\nmodels to demonstrate human-like reasoning based on the perceived information.\nTo address a crucial concern regarding the extent to which their reasoning\ncapabilities are fully consistent and grounded, we also measure the reasoning\nconsistency of these models. We achieve this by proposing a chain-of-thought\n(CoT) based consistency measure. However, such an evaluation requires a\nbenchmark that encompasses both high-level inference and detailed reasoning\nchains, which is costly. We tackle this challenge by proposing a\nLLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously\nensuring the generation of a high-quality dataset. Based on this pipeline and\nthe existing coarse-grained annotated dataset, we build the CURE benchmark to\nmeasure both the zero-shot reasoning performance and consistency of VLMs. We\nevaluate existing state-of-the-art VLMs, and find that even the best-performing\nmodel is unable to demonstrate strong visual reasoning capabilities and\nconsistency, indicating that substantial efforts are required to enable VLMs to\nperform visual reasoning as systematically and consistently as humans. As an\nearly step, we propose a two-stage training framework aimed at improving both\nthe reasoning performance and consistency of VLMs. The first stage involves\nemploying supervised fine-tuning of VLMs using step-by-step reasoning samples\nautomatically generated by LLMs. In the second stage, we further augment the\ntraining process by incorporating feedback provided by LLMs to produce\nreasoning chains that are highly consistent and grounded. We empirically\nhighlight the effectiveness of our framework in both reasoning performance and\nconsistency.", "comment": "NAACL 2024 Main Conference. The data is released at\n  https://github.com/Yangyi-Chen/CoTConsistency", "links": []}
{"entry_id": "2310.10207", "title": "Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World", "authors": ["Rujie Wu", "Xiaojian Ma", "Zhenliang Zhang", "Wei Wang", "Qing Li", "Song-Chun Zhu", "Yizhou Wang"], "published": "2023-10-16 09:19:18", "updated": "2024-03-18 09:05:12", "summary": "We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world\nfew-shot reasoning for machine vision. It originates from the classical Bongard\nProblems (BPs): Given two sets of images (positive and negative), the model\nneeds to identify the set that query images belong to by inducing the visual\nconcepts, which is exclusively depicted by images from the positive set. Our\nbenchmark inherits the few-shot concept induction of the original BPs while\nadding the two novel layers of challenge: 1) open-world free-form concepts, as\nthe visual concepts in Bongard-OpenWorld are unique compositions of terms from\nan open vocabulary, ranging from object categories to abstract visual\nattributes and commonsense factual knowledge; 2) real-world images, as opposed\nto the synthetic diagrams used by many counterparts. In our exploration,\nBongard-OpenWorld already imposes a significant challenge to current few-shot\nreasoning algorithms. We further investigate to which extent the recently\nintroduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can\nsolve our task, by directly probing VLMs, and combining VLMs and LLMs in an\ninteractive reasoning scheme. We even conceived a neuro-symbolic reasoning\napproach that reconciles LLMs & VLMs with logical reasoning to emulate the\nhuman problem-solving process for Bongard Problems. However, none of these\napproaches manage to close the human-machine gap, as the best learner achieves\n64% accuracy while human participants easily reach 91%. We hope\nBongard-OpenWorld can help us better understand the limitations of current\nvisual intelligence and facilitate future research on visual agents with\nstronger few-shot visual reasoning capabilities.", "comment": "Accepted to ICLR 2024", "links": []}
{"entry_id": "2403.11513", "title": "Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation", "authors": ["Joonhyung Lee", "Sangbeom Park", "Yongin Kwon", "Jemin Lee", "Minwook Ahn", "Sungjoon Choi"], "published": "2024-03-18 06:54:38", "updated": "2024-03-18 06:54:38", "summary": "In robotic object manipulation, human preferences can often be influenced by\nthe visual attributes of objects, such as color and shape. These properties\nplay a crucial role in operating a robot to interact with objects and align\nwith human intention. In this paper, we focus on the problem of inferring\nunderlying human preferences from a sequence of raw visual observations in\ntabletop manipulation environments with a variety of object types, named Visual\nPreference Inference (VPI). To facilitate visual reasoning in the context of\nmanipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR\nemploys a prompting mechanism that describes the difference between the\nconsecutive images (i.e., visual residuals) and incorporates such texts with a\nsequence of images to infer the user's preference. This approach significantly\nenhances the ability to understand and adapt to dynamic changes in its visual\nenvironment during manipulation tasks. Furthermore, we incorporate such texts\nalong with a sequence of images to infer the user's preferences. Our method\noutperforms baseline methods in terms of extracting human preferences from\nvisual sequences in both simulation and real-world environments. Code and\nvideos are available at:\n\\href{https://joonhyung-lee.github.io/vpi/}{https://joonhyung-lee.github.io/vpi/}", "comment": "8 pages", "links": []}
{"entry_id": "2403.06059", "title": "Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning", "authors": ["Yi Zhang", "Ce Zhang"], "published": "2024-03-10 01:34:45", "updated": "2024-03-10 01:34:45", "summary": "Vision-Language Pre-Trained (VLP) models, such as CLIP, have demonstrated\nremarkable effectiveness in learning generic visual representations. Several\napproaches aim to efficiently adapt VLP models to downstream tasks with limited\nsupervision, aiming to leverage the acquired knowledge from VLP models.\nHowever, these methods suffer from either introducing biased representations or\nrequiring high computational complexity, which hinders their effectiveness in\nfine-tuning the CLIP model. Moreover, when a model is trained on data specific\nto a particular domain, its ability to generalize to uncharted domains\ndiminishes. In this work, we propose Test-Time Distribution LearNing Adapter\n(TT-DNA) which directly works during the testing period. Specifically, we\nestimate Gaussian distributions to model visual features of the few-shot\nsupport images to capture the knowledge from the support set. The cosine\nsimilarity between query image and the feature distribution of support images\nis used as the prediction of visual adapter. Subsequently, the visual adapter's\nprediction merges with the original CLIP prediction via a residual connection,\nresulting in the final prediction. Our extensive experimental results on visual\nreasoning for human object interaction demonstrate that our proposed TT-DNA\noutperforms existing state-of-the-art methods by large margins.", "comment": "Accepted by ICASSP 2024", "links": []}
{"entry_id": "2308.06528", "title": "Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices", "authors": ["Jakub Kwiatkowski", "Krzysztof Krawiec"], "published": "2023-08-12 11:02:21", "updated": "2024-03-07 18:17:02", "summary": "Learning to perform abstract reasoning often requires decomposing the task in\nquestion into intermediate subgoals that are not specified upfront, but need to\nbe autonomously devised by the learner. In Raven Progressive Matrices (RPM),\nthe task is to choose one of the available answers given a context, where both\nthe context and answers are composite images featuring multiple objects in\nvarious spatial arrangements. As this high-level goal is the only guidance\navailable, learning to solve RPMs is challenging. In this study, we propose a\ndeep learning architecture based on the transformer blueprint which, rather\nthan directly making the above choice, addresses the subgoal of predicting the\nvisual properties of individual objects and their arrangements. The\nmultidimensional predictions obtained in this way are then directly juxtaposed\nto choose the answer. We consider a few ways in which the model parses the\nvisual input into tokens and several regimes of masking parts of the input in\nself-supervised training. In experimental assessment, the models not only\noutperform state-of-the-art methods but also provide interesting insights and\npartial explanations about the inference. The design of the method also makes\nit immune to biases that are known to be present in some RPM benchmarks.", "comment": "22 pages, 10 figures", "links": []}
{"entry_id": "2308.09778", "title": "Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models", "authors": ["Navid Rajabi", "Jana Kosecka"], "published": "2023-08-18 18:58:54", "updated": "2024-03-06 00:38:04", "summary": "Large vision-and-language models (VLMs) trained to match images with text on\nlarge-scale datasets of image-text pairs have shown impressive generalization\nability on several vision and language tasks. Several recent works, however,\nshowed that these models lack fine-grained understanding, such as the ability\nto count and recognize verbs, attributes, or relationships. The focus of this\nwork is to study the understanding of spatial relations. This has been tackled\npreviously using image-text matching (e.g., Visual Spatial Reasoning benchmark)\nor visual question answering (e.g., GQA or VQAv2), both showing poor\nperformance and a large gap compared to human performance. In this work, we\nshow qualitatively (using explainability tools) and quantitatively (using\nobject detectors) that the poor object localization \"grounding\" ability of the\nmodels is a contributing factor to the poor image-text matching performance. We\npropose an alternative fine-grained, compositional approach for recognizing and\nranking spatial clauses that combines the evidence from grounding noun phrases\ncorresponding to objects and their locations to compute the final rank of the\nspatial clause. We demonstrate the approach on representative VLMs (such as\nLXMERT, GPV, and MDETR) and compare and highlight their abilities to reason\nabout spatial relationships.", "comment": "Accepted to DMLR @ ICLR 2024", "links": []}
{"entry_id": "2403.03170", "title": "SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection", "authors": ["Peng Qi", "Zehong Yan", "Wynne Hsu", "Mong Li Lee"], "published": "2024-03-05 18:04:59", "updated": "2024-03-05 18:04:59", "summary": "Misinformation is a prevalent societal issue due to its potential high risks.\nOut-of-context (OOC) misinformation, where authentic images are repurposed with\nfalse text, is one of the easiest and most effective ways to mislead audiences.\nCurrent methods focus on assessing image-text consistency but lack convincing\nexplanations for their judgments, which is essential for debunking\nmisinformation. While Multimodal Large Language Models (MLLMs) have rich\nknowledge and innate capability for visual reasoning and explanation\ngeneration, they still lack sophistication in understanding and discovering the\nsubtle crossmodal differences. In this paper, we introduce SNIFFER, a novel\nmultimodal large language model specifically engineered for OOC misinformation\ndetection and explanation. SNIFFER employs two-stage instruction tuning on\nInstructBLIP. The first stage refines the model's concept alignment of generic\nobjects with news-domain entities and the second stage leverages language-only\nGPT-4 generated OOC-specific instruction data to fine-tune the model's\ndiscriminatory powers. Enhanced by external tools and retrieval, SNIFFER not\nonly detects inconsistencies between text and image but also utilizes external\nknowledge for contextual verification. Our experiments show that SNIFFER\nsurpasses the original MLLM by over 40% and outperforms state-of-the-art\nmethods in detection accuracy. SNIFFER also provides accurate and persuasive\nexplanations as validated by quantitative and human evaluations.", "comment": "To appear in CVPR 2024", "links": []}
{"entry_id": "2402.14818", "title": "PALO: A Polyglot Large Multimodal Model for 5B People", "authors": ["Muhammad Maaz", "Hanoona Rasheed", "Abdelrahman Shaker", "Salman Khan", "Hisham Cholakal", "Rao M. Anwer", "Tim Baldwin", "Michael Felsberg", "Fahad S. Khan"], "published": "2024-02-22 18:59:58", "updated": "2024-03-05 11:22:07", "summary": "In pursuit of more inclusive Vision-Language Models (VLMs), this study\nintroduces a Large Multilingual Multimodal Model called PALO. PALO offers\nvisual reasoning capabilities in 10 major languages, including English,\nChinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese,\nthat span a total of ~5B people (65% of the world population). Our approach\ninvolves a semi-automated translation approach to adapt the multimodal\ninstruction dataset from English to the target languages using a fine-tuned\nLarge Language Model, thereby ensuring high linguistic fidelity while allowing\nscalability due to minimal manual effort. The incorporation of diverse\ninstruction sets helps us boost overall performance across multiple languages\nespecially those that are underrepresented like Hindi, Arabic, Bengali, and\nUrdu. The resulting models are trained across three scales (1.7B, 7B and 13B\nparameters) to show the generalization and scalability where we observe\nsubstantial improvements compared to strong baselines. We also propose the\nfirst multilingual multimodal benchmark for the forthcoming approaches to\nevaluate their vision-language reasoning capabilities across languages. Code:\nhttps://github.com/mbzuai-oryx/PALO.", "comment": "Technical Report of PALO", "links": []}
{"entry_id": "2403.01404", "title": "What Is Missing in Multilingual Visual Reasoning and How to Fix It", "authors": ["Yueqi Song", "Simran Khanuja", "Graham Neubig"], "published": "2024-03-03 05:45:27", "updated": "2024-03-03 05:45:27", "summary": "NLP models today strive for supporting multiple languages and modalities,\nimproving accessibility for diverse users. In this paper, we evaluate their\nmultilingual, multimodal capabilities by testing on a visual reasoning task. We\nobserve that proprietary systems like GPT-4V obtain the best performance on\nthis task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits\nsimilar performance between English and other languages, indicating the\npotential for equitable system development across languages. Our analysis on\nmodel failures reveals three key aspects that make this task challenging:\nmultilinguality, complex reasoning, and multimodality. To address these\nchallenges, we propose three targeted interventions including a translate-test\napproach to tackle multilinguality, a visual programming approach to break down\ncomplex reasoning, and a novel method that leverages image captioning to\naddress multimodality. Our interventions achieve the best open performance on\nthis task in a zero-shot setting, boosting open model LLaVA by 13.4%, while\nalso minorly improving GPT-4V's performance.", "comment": null, "links": []}
{"entry_id": "2302.00389", "title": "Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications", "authors": ["Muhammad Arslan Manzoor", "Sarah Albarri", "Ziting Xian", "Zaiqiao Meng", "Preslav Nakov", "Shangsong Liang"], "published": "2023-02-01 11:48:34", "updated": "2024-03-01 18:44:59", "summary": "Multimodality Representation Learning, as a technique of learning to embed\ninformation from different modalities and their correlations, has achieved\nremarkable success on a variety of applications, such as Visual Question\nAnswering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision\nLanguage Retrieval (VLR). Among these applications, cross-modal interaction and\ncomplementary information from different modalities are crucial for advanced\nmodels to perform any multimodal task, e.g., understand, recognize, retrieve,\nor generate optimally. Researchers have proposed diverse methods to address\nthese tasks. The different variants of transformer-based architectures\nperformed extraordinarily on multiple modalities. This survey presents the\ncomprehensive literature on the evolution and enhancement of deep learning\nmultimodal architectures to deal with textual, visual and audio features for\ndiverse cross-modal and modern multimodal tasks. This study summarizes the (i)\nrecent task-specific deep learning methodologies, (ii) the pretraining types\nand multimodal pretraining objectives, (iii) from state-of-the-art pretrained\nmultimodal approaches to unifying architectures, and (iv) multimodal task\ncategories and possible future improvements that can be devised for better\nmultimodal learning. Moreover, we prepare a dataset section for new researchers\nthat covers most of the benchmarks for pretraining and finetuning. Finally,\nmajor challenges, gaps, and potential research topics are explored. A\nconstantly-updated paperlist related to our survey is maintained at\nhttps://github.com/marslanm/multimodality-representation-learning.", "comment": null, "links": []}
{"entry_id": "2308.11971", "title": "EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE", "authors": ["Junyi Chen", "Longteng Guo", "Jia Sun", "Shuai Shao", "Zehuan Yuan", "Liang Lin", "Dongyu Zhang"], "published": "2023-08-23 07:36:30", "updated": "2024-03-01 11:22:54", "summary": "Building scalable vision-language models to learn from diverse, multimodal\ndata remains an open challenge. In this paper, we introduce an Efficient\nVision-languagE foundation model, namely EVE, which is one unified multimodal\nTransformer pre-trained solely by one unified pre-training task. Specifically,\nEVE encodes both vision and language within a shared Transformer network\nintegrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which\ncapture modality-specific information by selectively switching to different\nexperts. To unify pre-training tasks of vision and language, EVE performs\nmasked signal modeling on image-text pairs to reconstruct masked signals, i.e.,\nimage pixels and text tokens, given visible signals. This simple yet effective\npre-training objective accelerates training by 3.5x compared to the model\npre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing\nto the combination of the unified architecture and pre-training task, EVE is\neasy to scale up, enabling better downstream performance with fewer resources\nand faster training speed. Despite its simplicity, EVE achieves\nstate-of-the-art performance on various vision-language downstream tasks,\nincluding visual question answering, visual reasoning, and image-text\nretrieval.", "comment": "Accepted by AAAI 2024", "links": []}
{"entry_id": "2403.00352", "title": "Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning", "authors": ["Ruiqian Nai", "Zixin Wen", "Ji Li", "Yuanzhi Li", "Yang Gao"], "published": "2024-03-01 08:31:58", "updated": "2024-03-01 08:31:58", "summary": "In representation learning, a disentangled representation is highly desirable\nas it encodes generative factors of data in a separable and compact pattern.\nResearchers have advocated leveraging disentangled representations to complete\ndownstream tasks with encouraging empirical evidence. This paper further\ninvestigates the necessity of disentangled representation in downstream\napplications. Specifically, we show that dimension-wise disentangled\nrepresentations are unnecessary on a fundamental downstream task, abstract\nvisual reasoning. We provide extensive empirical evidence against the necessity\nof disentanglement, covering multiple datasets, representation learning\nmethods, and downstream network architectures. Furthermore, our findings\nsuggest that the informativeness of representations is a better indicator of\ndownstream performance than disentanglement. Finally, the positive correlation\nbetween informativeness and disentanglement explains the claimed usefulness of\ndisentangled representations in previous works. The source code is available at\nhttps://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git.", "comment": "Accepted to AAAI-2024", "links": []}
{"entry_id": "2308.10562", "title": "Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories", "authors": ["Delfina Sol Martinez Pandiani", "Valentina Presutti"], "published": "2023-08-21 08:37:04", "updated": "2024-02-29 16:18:45", "summary": "The field of Computer Vision (CV) is increasingly shifting towards\n``high-level'' visual sensemaking tasks, yet the exact nature of these tasks\nremains unclear and tacit. This survey paper addresses this ambiguity by\nsystematically reviewing research on high-level visual understanding, focusing\nparticularly on Abstract Concepts (ACs) in automatic image classification. Our\nsurvey contributes in three main ways: Firstly, it clarifies the tacit\nunderstanding of high-level semantics in CV through a multidisciplinary\nanalysis, and categorization into distinct clusters, including commonsense,\nemotional, aesthetic, and inductive interpretative semantics. Secondly, it\nidentifies and categorizes computer vision tasks associated with high-level\nvisual sensemaking, offering insights into the diverse research areas within\nthis domain. Lastly, it examines how abstract concepts such as values and\nideologies are handled in CV, revealing challenges and opportunities in\nAC-based image classification. Notably, our survey of AC image classification\ntasks highlights persistent challenges, such as the limited efficacy of massive\ndatasets and the importance of integrating supplementary information and\nmid-level features. We emphasize the growing relevance of hybrid AI systems in\naddressing the multifaceted nature of AC image classification tasks. Overall,\nthis survey enhances our understanding of high-level visual reasoning in CV and\nlays the groundwork for future research endeavors.", "comment": "Preprint", "links": []}
{"entry_id": "2403.10534", "title": "VISREAS: Complex Visual Reasoning with Unanswerable Questions", "authors": ["Syeda Nahida Akter", "Sangwu Lee", "Yingshan Chang", "Yonatan Bisk", "Eric Nyberg"], "published": "2024-02-23 00:12:10", "updated": "2024-02-23 00:12:10", "summary": "Verifying a question's validity before answering is crucial in real-world\napplications, where users may provide imperfect instructions. In this scenario,\nan ideal model should address the discrepancies in the query and convey them to\nthe users rather than generating the best possible answer. Addressing this\nrequirement, we introduce a new compositional visual question-answering\ndataset, VISREAS, that consists of answerable and unanswerable visual queries\nformulated by traversing and perturbing commonalities and differences among\nobjects, attributes, and relations. VISREAS contains 2.07M semantically diverse\nqueries generated automatically using Visual Genome scene graphs. The unique\nfeature of this task, validating question answerability with respect to an\nimage before answering, and the poor performance of state-of-the-art models\ninspired the design of a new modular baseline, LOGIC2VISION that reasons by\nproducing and executing pseudocode without any external modules to generate the\nanswer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over\nLLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in\nperformance against the classification models.", "comment": "18 pages, 14 figures, 5 tables", "links": []}
{"entry_id": "2402.12675", "title": "Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach", "authors": ["Guillermo Puebla", "Jeffrey S. Bowers"], "published": "2024-02-20 02:48:14", "updated": "2024-02-20 02:48:14", "summary": "Achieving visual reasoning is a long-term goal of artificial intelligence. In\nthe last decade, several studies have applied deep neural networks (DNNs) to\nthe task of learning visual relations from images, with modest results in terms\nof generalization of the relations learned. However, in recent years,\nobject-centric representation learning has been put forward as a way to achieve\nvisual reasoning within the deep learning framework. Object-centric models\nattempt to model input scenes as compositions of objects and relations between\nthem. To this end, these models use several kinds of attention mechanisms to\nsegregate the individual objects in a scene from the background and from other\nobjects. In this work we tested relation learning and generalization in several\nobject-centric models, as well as a ResNet-50 baseline. In contrast to previous\nresearch, which has focused heavily in the same-different task in order to\nasses relational reasoning in DNNs, we use a set of tasks -- with varying\ndegrees of difficulty -- derived from the comparative cognition literature. Our\nresults show that object-centric models are able to segregate the different\nobjects in a scene, even in many out-of-distribution cases. In our simpler\ntasks, this improves their capacity to learn and generalize visual relations in\ncomparison to the ResNet-50 baseline. However, object-centric models still\nstruggle in our more difficult tasks and conditions. We conclude that abstract\nvisual reasoning remains an open challenge for DNNs, including object-centric\nmodels.", "comment": "16 pages, 14 figures", "links": []}
{"entry_id": "2402.11574", "title": "Visual In-Context Learning for Large Vision-Language Models", "authors": ["Yucheng Zhou", "Xiang Li", "Qianning Wang", "Jianbing Shen"], "published": "2024-02-18 12:43:38", "updated": "2024-02-18 12:43:38", "summary": "In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning\n(ICL) remains limited by challenges in cross-modal interactions and\nrepresentation disparities. To overcome these challenges, we introduce a novel\nVisual In-Context Learning (VICL) method comprising Visual Demonstration\nRetrieval, Intent-Oriented Image Summarization, and Intent-Oriented\nDemonstration Composition. Our approach retrieves images via ''Retrieval &\nRerank'' paradigm, summarises images with task intent and task-specific visual\nparsing, and composes language-based demonstrations that reduce token count and\nalleviate cross-modal interaction problem. Experimental evaluations on five\nvisual reasoning datasets demonstrate the effectiveness of our method.\nMoreover, our extensive experiments leverage information flow analysis to\nelucidate the effectiveness of our method, and investigate the impact of length\nand position of demonstrations for LVLM. The use of in-context unlearning\nfurther shows promise in resetting specific model knowledge without retraining.", "comment": "13 pages, 7 figures", "links": []}
{"entry_id": "2402.03507", "title": "Neural networks for abstraction and reasoning: Towards broad generalization in machines", "authors": ["Mikel Bober-Irizar", "Soumya Banerjee"], "published": "2024-02-05 20:48:57", "updated": "2024-02-05 20:48:57", "summary": "For half a century, artificial intelligence research has attempted to\nreproduce the human qualities of abstraction and reasoning - creating computer\nsystems that can learn new concepts from a minimal set of examples, in settings\nwhere humans find this easy. While specific neural networks are able to solve\nan impressive range of problems, broad generalisation to situations outside\ntheir training data has proved elusive.In this work, we look at several novel\napproaches for solving the Abstraction & Reasoning Corpus (ARC), a dataset of\nabstract visual reasoning tasks introduced to test algorithms on broad\ngeneralization. Despite three international competitions with $100,000 in\nprizes, the best algorithms still fail to solve a majority of ARC tasks and\nrely on complex hand-crafted rules, without using machine learning at all. We\nrevisit whether recent advances in neural networks allow progress on this task.\n  First, we adapt the DreamCoder neurosymbolic reasoning solver to ARC.\nDreamCoder automatically writes programs in a bespoke domain-specific language\nto perform reasoning, using a neural network to mimic human intuition. We\npresent the Perceptual Abstraction and Reasoning Language (PeARL) language,\nwhich allows DreamCoder to solve ARC tasks, and propose a new recognition model\nthat allows us to significantly improve on the previous best implementation.We\nalso propose a new encoding and augmentation scheme that allows large language\nmodels (LLMs) to solve ARC tasks, and find that the largest models can solve\nsome ARC tasks. LLMs are able to solve a different group of problems to\nstate-of-the-art solvers, and provide an interesting way to complement other\napproaches. We perform an ensemble analysis, combining models to achieve better\nresults than any system alone. Finally, we publish the arckit Python library to\nmake future research on ARC easier.", "comment": "32 pages main text, 17 pages", "links": []}
{"entry_id": "2401.04181", "title": "Language-Conditioned Robotic Manipulation with Fast and Slow Thinking", "authors": ["Minjie Zhu", "Yichen Zhu", "Jinming Li", "Junjie Wen", "Zhiyuan Xu", "Zhengping Che", "Chaomin Shen", "Yaxin Peng", "Dong Liu", "Feifei Feng", "Jian Tang"], "published": "2024-01-08 19:00:32", "updated": "2024-02-01 08:32:33", "summary": "The language-conditioned robotic manipulation aims to transfer natural\nlanguage instructions into executable actions, from simple pick-and-place to\ntasks requiring intent recognition and visual reasoning. Inspired by the dual\nprocess theory in cognitive science, which suggests two parallel systems of\nfast and slow thinking in human decision-making, we introduce Robotics with\nFast and Slow Thinking (RFST), a framework that mimics human cognitive\narchitecture to classify tasks and makes decisions on two systems based on\ninstruction types. Our RFST consists of two key components: 1) an instruction\ndiscriminator to determine which system should be activated based on the\ncurrent user instruction, and 2) a slow-thinking system that is comprised of a\nfine-tuned vision language model aligned with the policy networks, which allows\nthe robot to recognize user intention or perform reasoning tasks. To assess our\nmethodology, we built a dataset featuring real-world trajectories, capturing\nactions ranging from spontaneous impulses to tasks requiring deliberate\ncontemplation. Our results, both in simulation and real-world scenarios,\nconfirm that our approach adeptly manages intricate tasks that demand intent\nrecognition and reasoning. The project is available at\nhttps://jlm-z.github.io/RSFT/", "comment": "accepted to ICRA2024", "links": []}
{"entry_id": "2308.08334", "title": "Learning logic programs by discovering higher-order abstractions", "authors": ["Céline Hocquette", "Sebastijan Dumančić", "Andrew Cropper"], "published": "2023-08-16 12:50:10", "updated": "2024-01-29 18:34:39", "summary": "We introduce the higher-order refactoring problem, where the goal is to\ncompress a logic program by discovering higher-order abstractions, such as map,\nfilter, and fold. We implement our approach in Stevie, which formulates the\nrefactoring problem as a constraint optimisation problem. Our experiments on\nmultiple domains, including program synthesis and visual reasoning, show that\nrefactoring can improve the learning performance of an inductive logic\nprogramming system, specifically improving predictive accuracies by 27% and\nreducing learning times by 47%. We also show that Stevie can discover\nabstractions that transfer to multiple domains.", "comment": null, "links": []}
{"entry_id": "2401.16024", "title": "Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures", "authors": ["Michael Hersche", "Francesco di Stefano", "Thomas Hofmann", "Abu Sebastian", "Abbas Rahimi"], "published": "2024-01-29 10:17:18", "updated": "2024-01-29 10:17:18", "summary": "Abstract reasoning is a cornerstone of human intelligence, and replicating it\nwith artificial intelligence (AI) presents an ongoing challenge. This study\nfocuses on efficiently solving Raven's progressive matrices (RPM), a visual\ntest for assessing abstract reasoning abilities, by using distributed\ncomputation and operators provided by vector-symbolic architectures (VSA).\nInstead of hard-coding the rule formulations associated with RPMs, our approach\ncan learn the VSA rule formulations (hence the name Learn-VRF) with just one\npass through the training data. Yet, our approach, with compact parameters,\nremains transparent and interpretable. Learn-VRF yields accurate predictions on\nI-RAVEN's in-distribution data, and exhibits strong out-of-distribution\ncapabilities concerning unseen attribute-rule pairs, significantly\noutperforming pure connectionist baselines including large language models. Our\ncode is available at\nhttps://github.com/IBM/learn-vector-symbolic-architectures-rule-formulations.", "comment": "Accepted in NeurIPS 2023 Workshop on MATH-AI", "links": []}
{"entry_id": "2306.17778", "title": "Look, Remember and Reason: Grounded reasoning in videos with language models", "authors": ["Apratim Bhattacharyya", "Sunny Panchal", "Mingu Lee", "Reza Pourreza", "Pulkit Madan", "Roland Memisevic"], "published": "2023-06-30 16:31:14", "updated": "2024-01-22 00:54:30", "summary": "Multi-modal language models (LM) have recently shown promising performance in\nhigh-level reasoning tasks on videos. However, existing methods still fall\nshort in tasks like causal or compositional spatiotemporal reasoning over\nactions, in which model predictions need to be grounded in fine-grained\nlow-level details, such as object motions and object interactions. In this\nwork, we propose training an LM end-to-end on low-level surrogate tasks,\nincluding object detection, re-identification, and tracking, to endow the model\nwith the required low-level visual capabilities. We show that a two-stream\nvideo encoder with spatiotemporal attention is effective at capturing the\nrequired static and motion-based cues in the video. By leveraging the LM's\nability to perform the low-level surrogate tasks, we can cast reasoning in\nvideos as the three-step process of Look, Remember, Reason wherein visual\ninformation is extracted using low-level visual skills step-by-step and then\nintegrated to arrive at a final answer. We demonstrate the effectiveness of our\nframework on diverse visual reasoning tasks from the ACRE, CATER,\nSomething-Else and STAR datasets. Our approach is trainable end-to-end and\nsurpasses state-of-the-art task-specific methods across these tasks by a large\nmargin.", "comment": "To appear at ICLR 2024", "links": []}
{"entry_id": "2309.01409", "title": "Implicit Neural Image Stitching", "authors": ["Minsu Kim", "Jaewon Lee", "Byeonghun Lee", "Sunghoon Im", "Kyong Hwan Jin"], "published": "2023-09-04 07:40:30", "updated": "2024-01-22 00:22:14", "summary": "Existing frameworks for image stitching often provide visually reasonable\nstitchings. However, they suffer from blurry artifacts and disparities in\nillumination, depth level, etc. Although the recent learning-based stitchings\nrelax such disparities, the required methods impose sacrifice of image\nqualities failing to capture high-frequency details for stitched images. To\naddress the problem, we propose a novel approach, implicit Neural Image\nStitching (NIS) that extends arbitrary-scale super-resolution. Our method\nestimates Fourier coefficients of images for quality-enhancing warps. Then, the\nsuggested model blends color mismatches and misalignment in the latent space\nand decodes the features into RGB values of stitched images. Our experiments\nshow that our approach achieves improvement in resolving the low-definition\nimaging of the previous deep image stitching with favorable accelerated\nimage-enhancing methods. Our source code is available at\nhttps://github.com/minshu-kim/NIS.", "comment": null, "links": []}
{"entry_id": "2401.11035", "title": "Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually", "authors": ["Mazal Bethany", "Brandon Wherry", "Nishant Vishwamitra", "Peyman Najafirad"], "published": "2024-01-19 21:38:18", "updated": "2024-01-19 21:38:18", "summary": "Social media platforms are being increasingly used by malicious actors to\nshare unsafe content, such as images depicting sexual activity, cyberbullying,\nand self-harm. Consequently, major platforms use artificial intelligence (AI)\nand human moderation to obfuscate such images to make them safer. Two critical\nneeds for obfuscating unsafe images is that an accurate rationale for\nobfuscating image regions must be provided, and the sensitive regions should be\nobfuscated (\\textit{e.g.} blurring) for users' safety. This process involves\naddressing two key problems: (1) the reason for obfuscating unsafe images\ndemands the platform to provide an accurate rationale that must be grounded in\nunsafe image-specific attributes, and (2) the unsafe regions in the image must\nbe minimally obfuscated while still depicting the safe regions. In this work,\nwe address these key issues by first performing visual reasoning by designing a\nvisual reasoning model (VLM) conditioned on pre-trained unsafe image\nclassifiers to provide an accurate rationale grounded in unsafe image\nattributes, and then proposing a counterfactual explanation algorithm that\nminimally identifies and obfuscates unsafe regions for safe viewing, by first\nutilizing an unsafe image classifier attribution matrix to guide segmentation\nfor a more optimal subregion segmentation followed by an informed greedy search\nto determine the minimum number of subregions required to modify the\nclassifier's output based on attribution score. Extensive experiments on\nuncurated data from social networks emphasize the efficacy of our proposed\nmethod. We make our code available at:\nhttps://github.com/SecureAIAutonomyLab/ConditionalVLM", "comment": null, "links": []}
{"entry_id": "2212.08044", "title": "Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift", "authors": ["Jielin Qiu", "Yi Zhu", "Xingjian Shi", "Florian Wenzel", "Zhiqiang Tang", "Ding Zhao", "Bo Li", "Mu Li"], "published": "2022-12-15 18:52:03", "updated": "2024-01-19 15:29:34", "summary": "Multimodal image-text models have shown remarkable performance in the past\nfew years. However, evaluating robustness against distribution shifts is\ncrucial before adopting them in real-world applications. In this work, we\ninvestigate the robustness of 12 popular open-sourced image-text models under\ncommon perturbations on five tasks (image-text retrieval, visual reasoning,\nvisual entailment, image captioning, and text-to-image generation). In\nparticular, we propose several new multimodal robustness benchmarks by applying\n17 image perturbation and 16 text perturbation techniques on top of existing\ndatasets. We observe that multimodal models are not robust to image and text\nperturbations, especially to image perturbations. Among the tested perturbation\nmethods, character-level perturbations constitute the most severe distribution\nshift for text, and zoom blur is the most severe shift for image data. We also\nintroduce two new robustness metrics (\\textbf{MMI} for MultiModal Impact score\nand \\textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal\nmodels. We hope our extensive study sheds light on new directions for the\ndevelopment of robust multimodal models. More details can be found on the\nproject webpage: \\url{https://MMRobustness.github.io}.", "comment": "Accepted by Journal of Data-centric Machine Learning Research (DMLR)\n  2024", "links": []}
{"entry_id": "2401.08695", "title": "Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence", "authors": ["Zhengqing Fang", "Shuowen Zhou", "Zhouhang Yuan", "Yuxuan Si", "Mengze Li", "Jinxu Li", "Yesheng Xu", "Wenjia Xie", "Kun Kuang", "Yingming Li", "Fei Wu", "Yu-Feng Yao"], "published": "2024-01-14 02:10:54", "updated": "2024-01-14 02:10:54", "summary": "Although data-driven artificial intelligence (AI) in medical image diagnosis\nhas shown impressive performance in silico, the lack of interpretability makes\nit difficult to incorporate the \"black box\" into clinicians' workflows. To make\nthe diagnostic patterns learned from data understandable by clinicians, we\ndevelop an interpretable model, knowledge-guided diagnosis model (KGDM), that\nprovides a visualized reasoning process containing AI-based biomarkers and\nretrieved cases that with the same diagnostic patterns. It embraces clinicians'\nprompts into the interpreted reasoning through human-AI interaction, leading to\npotentially enhanced safety and more accurate predictions. This study\ninvestigates the performance, interpretability, and clinical utility of KGDM in\nthe diagnosis of infectious keratitis (IK), which is the leading cause of\ncorneal blindness. The classification performance of KGDM is evaluated on a\nprospective validation dataset, an external testing dataset, and an publicly\navailable testing dataset. The diagnostic odds ratios (DOR) of the interpreted\nAI-based biomarkers are effective, ranging from 3.011 to 35.233 and exhibit\nconsistent diagnostic patterns with clinic experience. Moreover, a human-AI\ncollaborative diagnosis test is conducted and the participants with\ncollaboration achieved a performance exceeding that of both humans and AI. By\nsynergistically integrating interpretability and interaction, this study\nfacilitates the convergence of clinicians' expertise and data-driven\nintelligence. The promotion of inexperienced ophthalmologists with the aid of\nAI-based biomarkers, as well as increased AI prediction by intervention from\nexperienced ones, demonstrate a promising diagnostic paradigm for infectious\nkeratitis using KGDM, which holds the potential for extension to other diseases\nwhere experienced medical practitioners are limited and the safety of AI is\nconcerned.", "comment": "33 pages", "links": []}
{"entry_id": "2301.13335", "title": "Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning", "authors": ["Jian Zhu", "Hanli Wang", "Miaojing Shi"], "published": "2023-01-30 23:43:28", "updated": "2023-12-25 12:59:02", "summary": "The visual commonsense reasoning (VCR) task is to choose an answer and\nprovide a justifying rationale based on the given image and textural question.\nRepresentative works first recognize objects in images and then associate them\nwith key words in texts. However, existing approaches do not consider exact\npositions of objects in a human-like three-dimensional (3D) manner, making them\nincompetent to accurately distinguish objects and understand visual relation.\nRecently, multi-modal large language models (MLLMs) have been used as powerful\ntools for several multi-modal tasks but not for VCR yet, which requires\nelaborate reasoning on specific visual objects referred by texts. In light of\nthe above, an MLLM enhanced pseudo 3D perception framework is designed for VCR.\nSpecifically, we first demonstrate that the relation between objects is\nrelevant to object depths in images, and hence introduce object depth into VCR\nframeworks to infer 3D positions of objects in images. Then, a depth-aware\nTransformer is proposed to encode depth differences between objects into the\nattention mechanism of Transformer to discriminatively associate objects with\nvisual scenes guided by depth. To further associate the answer with the depth\nof visual scene, each word in the answer is tagged with a pseudo depth to\nrealize depth-aware association between answer words and objects. On the other\nhand, BLIP-2 as an MLLM is employed to process images and texts, and the\nreferring expressions in texts involving specific visual objects are modified\nwith linguistic object labels to serve as comprehensible MLLM inputs. Finally,\na parameter optimization technique is devised to fully consider the quality of\ndata batches based on multi-level reasoning confidence. Experiments on the VCR\ndataset demonstrate the superiority of the proposed framework over\nstate-of-the-art approaches.", "comment": null, "links": []}
{"entry_id": "2312.14233", "title": "VCoder: Versatile Vision Encoders for Multimodal Large Language Models", "authors": ["Jitesh Jain", "Jianwei Yang", "Humphrey Shi"], "published": "2023-12-21 18:49:47", "updated": "2023-12-21 18:49:47", "summary": "Humans possess the remarkable skill of Visual Perception, the ability to see\nand understand the seen, helping them make sense of the visual world and, in\nturn, reason. Multimodal Large Language Models (MLLM) have recently achieved\nimpressive performance on vision-language tasks ranging from visual\nquestion-answering and image captioning to visual reasoning and image\ngeneration. However, when prompted to identify or count (perceive) the entities\nin a given image, existing MLLM systems fail. Working towards developing an\naccurate MLLM system for perception and reasoning, we propose using Versatile\nvision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the\nVCoder with perception modalities such as segmentation or depth maps, improving\nthe MLLM's perception abilities. Secondly, we leverage the images from COCO and\noutputs from off-the-shelf vision perception models to create our COCO\nSegmentation Text (COST) dataset for training and evaluating MLLMs on the\nobject perception task. Thirdly, we introduce metrics to assess the object\nperception abilities in MLLMs on our COST dataset. Lastly, we provide extensive\nexperimental evidence proving the VCoder's improved object-level perception\nskills over existing Multimodal LLMs, including GPT-4V. We open-source our\ndataset, code, and models to promote research. We open-source our code at\nhttps://github.com/SHI-Labs/VCoder", "comment": "Project Page: https://praeclarumjj3.github.io/vcoder/", "links": []}
{"entry_id": "2312.12436", "title": "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", "authors": ["Chaoyou Fu", "Renrui Zhang", "Zihan Wang", "Yubo Huang", "Zhengye Zhang", "Longtian Qiu", "Gaoxiang Ye", "Yunhang Shen", "Mengdan Zhang", "Peixian Chen", "Sirui Zhao", "Shaohui Lin", "Deqiang Jiang", "Di Yin", "Peng Gao", "Ke Li", "Hongsheng Li", "Xing Sun"], "published": "2023-12-19 18:59:22", "updated": "2023-12-20 12:40:47", "summary": "The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.", "comment": "Total 120 pages. See our project at\n  https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models", "links": []}
{"entry_id": "2308.09936", "title": "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions", "authors": ["Wenbo Hu", "Yifan Xu", "Yi Li", "Weiyue Li", "Zeyuan Chen", "Zhuowen Tu"], "published": "2023-08-19 07:53:43", "updated": "2023-12-18 04:33:17", "summary": "Vision Language Models (VLMs), which extend Large Language Models (LLM) by\nincorporating visual understanding capability, have demonstrated significant\nadvancements in addressing open-ended visual question-answering (VQA) tasks.\nHowever, these models cannot accurately interpret images infused with text, a\ncommon occurrence in real-world scenarios. Standard procedures for extracting\ninformation from images often involve learning a fixed set of query embeddings.\nThese embeddings are designed to encapsulate image contexts and are later used\nas soft prompt inputs in LLMs. Yet, this process is limited to the token count,\npotentially curtailing the recognition of scenes with text-rich context. To\nimprove upon them, the present study introduces BLIVA: an augmented version of\nInstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings\nfrom InstructBLIP and also directly projects encoded patch embeddings into the\nLLM, a technique inspired by LLaVA. This approach assists the model to capture\nintricate details potentially missed during the query decoding process.\nEmpirical evidence demonstrates that our model, BLIVA, significantly enhances\nperformance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA\nbenchmark) and in undertaking general (not particularly text-rich) VQA\nbenchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved\n17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME),\ncomparing to our baseline InstructBLIP. BLIVA demonstrates significant\ncapability in decoding real-world images, irrespective of text presence. To\ndemonstrate the broad industry applications enabled by BLIVA, we evaluate the\nmodel using a new dataset comprising YouTube thumbnails paired with\nquestion-answer sets across 11 diverse categories. Our code and models are\nfreely accessible at https://github.com/mlpc-ucsd/BLIVA.", "comment": "Accepted at AAAI Conference on Artificial Intelligence (AAAI-24)", "links": []}
{"entry_id": "2307.08506", "title": "Does Visual Pretraining Help End-to-End Reasoning?", "authors": ["Chen Sun", "Calvin Luo", "Xingyi Zhou", "Anurag Arnab", "Cordelia Schmid"], "published": "2023-07-17 14:08:38", "updated": "2023-12-16 00:05:07", "summary": "We aim to investigate whether end-to-end learning of visual reasoning can be\nachieved with general-purpose neural networks, with the help of visual\npretraining. A positive result would refute the common belief that explicit\nvisual abstraction (e.g. object detection) is essential for compositional\ngeneralization on visual reasoning, and confirm the feasibility of a neural\nnetwork \"generalist\" to solve visual recognition and reasoning tasks. We\npropose a simple and general self-supervised framework which \"compresses\" each\nvideo frame into a small set of tokens with a transformer network, and\nreconstructs the remaining frames based on the compressed temporal context. To\nminimize the reconstruction loss, the network must learn a compact\nrepresentation for each image, as well as capture temporal dynamics and object\npermanence from temporal context. We perform evaluation on two visual reasoning\nbenchmarks, CATER and ACRE. We observe that pretraining is essential to achieve\ncompositional generalization for end-to-end visual reasoning. Our proposed\nframework outperforms traditional supervised pretraining, including image\nclassification and explicit object detection, by large margins.", "comment": "NeurIPS 2023", "links": []}
{"entry_id": "2312.09997", "title": "One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2023-12-15 18:15:20", "updated": "2023-12-15 18:15:20", "summary": "Abstract Visual Reasoning (AVR) comprises a wide selection of various\nproblems similar to those used in human IQ tests. Recent years have brought\ndynamic progress in solving particular AVR tasks, however, in the contemporary\nliterature AVR problems are largely dealt with in isolation, leading to highly\nspecialized task-specific methods. With the aim of developing universal\nlearning systems in the AVR domain, we propose the unified model for solving\nSingle-Choice Abstract visual Reasoning tasks (SCAR), capable of solving\nvarious single-choice AVR tasks, without making any a priori assumptions about\nthe task structure, in particular the number and location of panels. The\nproposed model relies on a novel Structure-Aware dynamic Layer (SAL), which\nadapts its weights to the structure of the considered AVR problem. Experiments\nconducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One\nOut problems show that SCAR (SAL-based models, in general) effectively solves\ndiverse AVR tasks, and its performance is on par with the state-of-the-art\ntask-specific baselines. What is more, SCAR demonstrates effective knowledge\nreuse in multi-task and transfer learning settings. To our knowledge, this work\nis the first successful attempt to construct a general single-choice AVR solver\nrelying on self-configurable architecture and unified solving method. With this\nwork we aim to stimulate and foster progress on task-independent research paths\nin the AVR domain, with the long-term goal of development of a general AVR\nsolver.", "comment": "Accepted to The 38th Annual AAAI Conference on Artificial\n  Intelligence (AAAI 2024)", "links": []}
{"entry_id": "2107.01671", "title": "Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory", "authors": ["Xuejiao Tang", "Xin Huang", "Wenbin Zhang", "Travers B. Child", "Qiong Hu", "Zhen Liu", "Ji Zhang"], "published": "2021-07-04 15:58:31", "updated": "2023-12-07 23:22:52", "summary": "Visual Commonsense Reasoning (VCR) predicts an answer with corresponding\nrationale, given a question-image input. VCR is a recently introduced visual\nscene understanding task with a wide range of applications, including visual\nquestion answering, automated vehicle systems, and clinical decision support.\nPrevious approaches to solving the VCR task generally rely on pre-training or\nexploiting memory with long dependency relationship encoded models. However,\nthese approaches suffer from a lack of generalizability and prior knowledge. In\nthis paper we propose a dynamic working memory based cognitive VCR network,\nwhich stores accumulated commonsense between sentences to provide prior\nknowledge for inference. Extensive experiments show that the proposed model\nyields significant improvements over existing methods on the benchmark VCR\ndataset. Moreover, the proposed model provides intuitive interpretation into\nvisual commonsense reasoning. A Python implementation of our mechanism is\npublicly available at https://github.com/tanjatang/DMVCR", "comment": "DaWaK 2021", "links": []}
{"entry_id": "2108.02924", "title": "Interpretable Visual Understanding with Cognitive Attention Network", "authors": ["Xuejiao Tang", "Wenbin Zhang", "Yi Yu", "Kea Turner", "Tyler Derr", "Mengyu Wang", "Eirini Ntoutsi"], "published": "2021-08-06 02:57:43", "updated": "2023-12-07 23:09:57", "summary": "While image understanding on recognition-level has achieved remarkable\nadvancements, reliable visual scene understanding requires comprehensive image\nunderstanding on recognition-level but also cognition-level, which calls for\nexploiting the multi-source information as well as learning different levels of\nunderstanding and extensive commonsense knowledge. In this paper, we propose a\nnovel Cognitive Attention Network (CAN) for visual commonsense reasoning to\nachieve interpretable visual understanding. Specifically, we first introduce an\nimage-text fusion module to fuse information from images and text collectively.\nSecond, a novel inference module is designed to encode commonsense among image,\nquery and response. Extensive experiments on large-scale Visual Commonsense\nReasoning (VCR) benchmark dataset demonstrate the effectiveness of our\napproach. The implementation is publicly available at\nhttps://github.com/tanjatang/CAN", "comment": "ICANN21", "links": []}
{"entry_id": "2312.04314", "title": "GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives", "authors": ["Zuyao Chen", "Jinlin Wu", "Zhen Lei", "Zhaoxiang Zhang", "Changwen Chen"], "published": "2023-12-07 14:11:00", "updated": "2023-12-07 14:11:00", "summary": "Learning scene graphs from natural language descriptions has proven to be a\ncheap and promising scheme for Scene Graph Generation (SGG). However, such\nunstructured caption data and its processing are troubling the learning an\nacurrate and complete scene graph. This dilema can be summarized as three\npoints. First, traditional language parsers often fail to extract meaningful\nrelationship triplets from caption data. Second, grounding unlocalized objects\nin parsed triplets will meet ambiguity in visual-language alignment. Last,\ncaption data typically are sparse and exhibit bias to partial observations of\nimage content. These three issues make it hard for the model to generate\ncomprehensive and accurate scene graphs. To fill this gap, we propose a simple\nyet effective framework, GPT4SGG, to synthesize scene graphs from holistic and\nregion-specific narratives. The framework discards traditional language parser,\nand localize objects before obtaining relationship triplets. To obtain\nrelationship triplets, holistic and dense region-specific narratives are\ngenerated from the image. With such textual representation of image data and a\ntask-specific prompt, an LLM, particularly GPT-4, directly synthesizes a scene\ngraph as \"pseudo labels\". Experimental results showcase GPT4SGG significantly\nimproves the performance of SGG models trained on image-caption data. We\nbelieve this pioneering work can motivate further research into mining the\nvisual reasoning capabilities of LLMs.", "comment": null, "links": []}
{"entry_id": "2312.02896", "title": "BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models", "authors": ["Rizhao Cai", "Zirui Song", "Dayan Guan", "Zhenhao Chen", "Xing Luo", "Chenyu Yi", "Alex Kot"], "published": "2023-12-05 17:06:59", "updated": "2023-12-06 03:46:47", "summary": "Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable\ncapabilities in visual reasoning with common image styles. However, their\nrobustness against diverse style shifts, crucial for practical applications,\nremains largely unexplored. In this paper, we propose a new benchmark,\nBenchLMM, to assess the robustness of LMMs against three different styles:\nartistic image style, imaging sensor style, and application style, where each\nstyle has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate\nstate-of-the-art LMMs and reveal: 1) LMMs generally suffer performance\ndegradation when working with other styles; 2) An LMM performs better than\nanother model in common style does not guarantee its superior performance in\nother styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs\nto predict the style first, based on which we propose a versatile and\ntraining-free method for improving LMMs; 4) An intelligent LMM is expected to\ninterpret the causes of its errors when facing stylistic variations. We hope\nthat our benchmark and analysis can shed new light on developing more\nintelligent and versatile LMMs.", "comment": "Code is available at https://github.com/AIFEG/BenchLMM", "links": []}
{"entry_id": "2309.09809", "title": "A Continual Learning Paradigm for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks", "authors": ["Wentao Wan", "Nan Kang", "Zeqing Wang", "Zhuojie Yang", "Liang Lin", "Keze Wang"], "published": "2023-09-18 14:28:47", "updated": "2023-11-30 09:31:59", "summary": "Recently, the visual programming framework (VisProg) has emerged as a\nsignificant framework for executing compositional visual tasks due to its\ninterpretability and flexibility. However, the performance of VisProg on\nspecific Visual Reasoning (VR) tasks is markedly inferior compared to\nwell-trained task-specific models since its employed visual sub-modules have\nlimited generalization capabilities. Due to the non-differentiability of\nVisProg, it is quite challenging to improve these visual sub-modules within\nVisProg for the specific VR task while maintaining their generalizability on\nthe un-seen tasks. Attempt to overcome these difficulties, we propose CLVP, a\nContinuous Learning paradigm for VisProg across various visual reasoning tasks.\nSpecifically, our CLVP distills the capabilities of well-trained task-specific\nmodels into the visual sub-modules in a stepwise and anti-forgetting manner.\nThis can continually improve the performance of VisProg on multiple visual\ntasks while preserving the flexibility of VisProg. Extensive and comprehensive\nexperimental results demonstrate that our CLVP obtains significant performance\ngains on specific VR benchmarks, i.e., GQA (+1.4%) and NLVRv2 (+5.6%), compared\nto the VisProg baseline, and also maintains a promising generalizability for VR\non un-seen and previous learned tasks.", "comment": null, "links": []}
{"entry_id": "2311.16101", "title": "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs", "authors": ["Haoqin Tu", "Chenhang Cui", "Zijun Wang", "Yiyang Zhou", "Bingchen Zhao", "Junlin Han", "Wangchunshu Zhou", "Huaxiu Yao", "Cihang Xie"], "published": "2023-11-27 18:59:42", "updated": "2023-11-27 18:59:42", "summary": "This work focuses on the potential of Vision LLMs (VLLMs) in visual\nreasoning. Different from prior studies, we shift our focus from evaluating\nstandard performance to introducing a comprehensive safety evaluation suite,\ncovering both out-of-distribution (OOD) generalization and adversarial\nrobustness. For the OOD evaluation, we present two novel VQA datasets, each\nwith one variant, designed to test model performance under challenging\nconditions. In exploring adversarial robustness, we propose a straightforward\nattack strategy for misleading VLLMs to produce visual-unrelated responses.\nMoreover, we assess the efficacy of two jailbreaking strategies, targeting\neither the vision or language component of VLLMs. Our evaluation of 21 diverse\nmodels, ranging from open-source VLLMs to GPT-4V, yields interesting\nobservations: 1) Current VLLMs struggle with OOD texts but not images, unless\nthe visual information is limited; and 2) These VLLMs can be easily misled by\ndeceiving vision encoders only, and their vision-language training often\ncompromise safety protocols. We release this safety evaluation suite at\nhttps://github.com/UCSC-VLAA/vllm-safety-benchmark.", "comment": "H.T., C.C., and Z.W. contribute equally. Work done during H.T. and\n  Z.W.'s internship at UCSC, and C.C. and Y.Z.'s internship at UNC", "links": []}
{"entry_id": "2311.12391", "title": "From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation", "authors": ["Jiaxin Ge", "Sanjay Subramanian", "Trevor Darrell", "Boyi Li"], "published": "2023-11-21 07:02:32", "updated": "2023-11-21 07:02:32", "summary": "Addressing the challenge of adapting pre-trained vision-language models for\ngenerating insightful explanations for visual reasoning tasks with limited\nannotations, we present ReVisE: a $\\textbf{Re}$cursive $\\textbf{Vis}$ual\n$\\textbf{E}$xplanation algorithm. Our method iteratively computes visual\nfeatures (conditioned on the text input), an answer, and an explanation, to\nimprove the explanation quality step by step until the answer converges. We\nfind that this multi-step approach guides the model to correct its own answers\nand outperforms single-step explanation generation. Furthermore, explanations\ngenerated by ReVisE also serve as valuable annotations for few-shot\nself-training. Our approach outperforms previous methods while utilizing merely\n5% of the human-annotated explanations across 10 metrics, demonstrating up to a\n4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets,\nunderscoring the efficacy and data-efficiency of our method.", "comment": "EMNLP 2023 Main", "links": []}
{"entry_id": "2302.06494", "title": "Explicit3D: Graph Network with Spatial Inference for Single Image 3D Object Detection", "authors": ["Yanjun Liu", "Wenming Yang"], "published": "2023-02-13 16:19:54", "updated": "2023-11-20 08:44:23", "summary": "Indoor 3D object detection is an essential task in single image scene\nunderstanding, impacting spatial cognition fundamentally in visual reasoning.\nExisting works on 3D object detection from a single image either pursue this\ngoal through independent predictions of each object or implicitly reason over\nall possible objects, failing to harness relational geometric information\nbetween objects. To address this problem, we propose a dynamic sparse graph\npipeline named Explicit3D based on object geometry and semantics features.\nTaking the efficiency into consideration, we further define a relatedness score\nand design a novel dynamic pruning algorithm followed by a cluster sampling\nmethod for sparse scene graph generation and updating. Furthermore, our\nExplicit3D introduces homogeneous matrices and defines new relative loss and\ncorner loss to model the spatial difference between target pairs explicitly.\nInstead of using ground-truth labels as direct supervision, our relative and\ncorner loss are derived from the homogeneous transformation, which renders the\nmodel to learn the geometric consistency between objects. The experimental\nresults on the SUN RGB-D dataset demonstrate that our Explicit3D achieves\nbetter performance balance than the-state-of-the-art.", "comment": null, "links": []}
{"entry_id": "2311.08083", "title": "Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method", "authors": ["Luca H. Thoms", "Karel A. Veldkamp", "Hannes Rosenbusch", "Claire E. Stevenson"], "published": "2023-11-14 11:10:46", "updated": "2023-11-14 11:10:46", "summary": "Analogical reasoning derives information from known relations and generalizes\nthis information to similar yet unfamiliar situations. One of the first\ngeneralized ways in which deep learning models were able to solve verbal\nanalogies was through vector arithmetic of word embeddings, essentially\nrelating words that were mapped to a vector space (e.g., king - man + woman =\n__?). In comparison, most attempts to solve visual analogies are still\npredominantly task-specific and less generalizable. This project focuses on\nvisual analogical reasoning and applies the initial generalized mechanism used\nto solve verbal analogies to the visual realm. Taking the Abstraction and\nReasoning Corpus (ARC) as an example to investigate visual analogy solving, we\nuse a variational autoencoder (VAE) to transform ARC items into low-dimensional\nlatent vectors, analogous to the word embeddings used in the verbal approaches.\nThrough simple vector arithmetic, underlying rules of ARC items are discovered\nand used to solve them. Results indicate that the approach works well on simple\nitems with fewer dimensions (i.e., few colors used, uniform shapes), similar\ninput-to-output examples, and high reconstruction accuracy on the VAE.\nPredictions on more complex items showed stronger deviations from expected\noutputs, although, predictions still often approximated parts of the item's\nrule set. Error patterns indicated that the model works as intended. On the\nofficial ARC paradigm, the model achieved a score of 2% (cf. current world\nrecord is 21%) and on ConceptARC it scored 8.8%. Although the methodology\nproposed involves basic dimensionality reduction techniques and standard vector\narithmetic, this approach demonstrates promising outcomes on ARC and can easily\nbe generalized to other abstract visual reasoning tasks.", "comment": "Data and code can be found on\n  https://github.com/foger3/ARC_DeepLearning", "links": ["http://dx.doi.org/10.17605/OSF.IO/AKP86"]}
{"entry_id": "2311.06964", "title": "Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels", "authors": ["Vijay Veerabadran", "Srinivas Ravishankar", "Yuan Tang", "Ritik Raina", "Virginia R. de Sa"], "published": "2023-11-12 21:07:04", "updated": "2023-11-12 21:07:04", "summary": "Humans solving algorithmic (or) reasoning problems typically exhibit solution\ntimes that grow as a function of problem difficulty. Adaptive recurrent neural\nnetworks have been shown to exhibit this property for various\nlanguage-processing tasks. However, little work has been performed to assess\nwhether such adaptive computation can also enable vision models to extrapolate\nsolutions beyond their training distribution's difficulty level, with prior\nwork focusing on very simple tasks. In this study, we investigate a critical\nfunctional role of such adaptive processing using recurrent neural networks: to\ndynamically scale computational resources conditional on input requirements\nthat allow for zero-shot generalization to novel difficulty levels not seen\nduring training using two challenging visual reasoning tasks: PathFinder and\nMazes. We combine convolutional recurrent neural networks (ConvRNNs) with a\nlearnable halting mechanism based on Graves (2016). We explore various\nimplementations of such adaptive ConvRNNs (AdRNNs) ranging from tying weights\nacross layers to more sophisticated biologically inspired recurrent networks\nthat possess lateral connections and gating. We show that 1) AdRNNs learn to\ndynamically halt processing early (or late) to solve easier (or harder)\nproblems, 2) these RNNs zero-shot generalize to more difficult problem settings\nnot shown during training by dynamically increasing the number of recurrent\niterations at test time. Our study provides modeling evidence supporting the\nhypothesis that recurrent processing enables the functional advantage of\nadaptively allocating compute resources conditional on input requirements and\nhence allowing generalization to harder difficulty levels of a visual reasoning\nproblem without training.", "comment": "37th Conference on Neural Information Processing Systems (NeurIPS\n  2023)", "links": []}
{"entry_id": "2311.06553", "title": "Visual Commonsense based Heterogeneous Graph Contrastive Learning", "authors": ["Zongzhao Li", "Xiangyu Zhu", "Xi Zhang", "Zhaoxiang Zhang", "Zhen Lei"], "published": "2023-11-11 12:01:18", "updated": "2023-11-11 12:01:18", "summary": "How to select relevant key objects and reason about the complex relationships\ncross vision and linguistic domain are two key issues in many multi-modality\napplications such as visual question answering (VQA). In this work, we\nincorporate the visual commonsense information and propose a heterogeneous\ngraph contrastive learning method to better finish the visual reasoning task.\nOur method is designed as a plug-and-play way, so that it can be quickly and\neasily combined with a wide range of representative methods. Specifically, our\nmodel contains two key components: the Commonsense-based Contrastive Learning\nand the Graph Relation Network. Using contrastive learning, we guide the model\nconcentrate more on discriminative objects and relevant visual commonsense\nattributes. Besides, thanks to the introduction of the Graph Relation Network,\nthe model reasons about the correlations between homogeneous edges and the\nsimilarities between heterogeneous edges, which makes information transmission\nmore effective. Extensive experiments on four benchmarks show that our method\ngreatly improves seven representative VQA models, demonstrating its\neffectiveness and generalizability.", "comment": null, "links": []}
{"entry_id": "2306.02500", "title": "Systematic Visual Reasoning through Object-Centric Relational Abstraction", "authors": ["Taylor W. Webb", "Shanka Subhra Mondal", "Jonathan D. Cohen"], "published": "2023-06-04 22:47:17", "updated": "2023-11-10 22:22:44", "summary": "Human visual reasoning is characterized by an ability to identify abstract\npatterns from only a small number of examples, and to systematically generalize\nthose patterns to novel inputs. This capacity depends in large part on our\nability to represent complex visual inputs in terms of both objects and\nrelations. Recent work in computer vision has introduced models with the\ncapacity to extract object-centric representations, leading to the ability to\nprocess multi-object visual inputs, but falling short of the systematic\ngeneralization displayed by human reasoning. Other recent models have employed\ninductive biases for relational abstraction to achieve systematic\ngeneralization of learned abstract rules, but have generally assumed the\npresence of object-focused inputs. Here, we combine these two approaches,\nintroducing Object-Centric Relational Abstraction (OCRA), a model that extracts\nexplicit representations of both objects and abstract relations, and achieves\nstrong systematic generalization in tasks (including a novel dataset,\nCLEVR-ART, with greater visual complexity) involving complex visual displays.", "comment": null, "links": []}
{"entry_id": "2311.06386", "title": "Towards A Unified Neural Architecture for Visual Recognition and Reasoning", "authors": ["Calvin Luo", "Boqing Gong", "Ting Chen", "Chen Sun"], "published": "2023-11-10 20:27:43", "updated": "2023-11-10 20:27:43", "summary": "Recognition and reasoning are two pillars of visual understanding. However,\nthese tasks have an imbalance in focus; whereas recent advances in neural\nnetworks have shown strong empirical performance in visual recognition, there\nhas been comparably much less success in solving visual reasoning. Intuitively,\nunifying these two tasks under a singular framework is desirable, as they are\nmutually dependent and beneficial. Motivated by the recent success of\nmulti-task transformers for visual recognition and language understanding, we\npropose a unified neural architecture for visual recognition and reasoning with\na generic interface (e.g., tokens) for both. Our framework enables the\nprincipled investigation of how different visual recognition tasks, datasets,\nand inductive biases can help enable spatiotemporal reasoning capabilities.\nNoticeably, we find that object detection, which requires spatial localization\nof individual objects, is the most beneficial recognition task for reasoning.\nWe further demonstrate via probing that implicit object-centric representations\nemerge automatically inside our framework. Intriguingly, we discover that\ncertain architectural choices such as the backbone model of the visual encoder\nhave a significant impact on visual reasoning, but little on object detection.\nGiven the results of our experiments, we believe that visual reasoning should\nbe considered as a first-class citizen alongside visual recognition, as they\nare strongly correlated but benefit from potentially different design choices.", "comment": null, "links": []}
{"entry_id": "2311.05298", "title": "Improving Vision-and-Language Reasoning via Spatial Relations Modeling", "authors": ["Cheng Yang", "Rui Xu", "Ye Guo", "Peixiang Huang", "Yiru Chen", "Wenkui Ding", "Zhongyuan Wang", "Hong Zhou"], "published": "2023-11-09 11:54:55", "updated": "2023-11-09 11:54:55", "summary": "Visual commonsense reasoning (VCR) is a challenging multi-modal task, which\nrequires high-level cognition and commonsense reasoning ability about the real\nworld. In recent years, large-scale pre-training approaches have been developed\nand promoted the state-of-the-art performance of VCR. However, the existing\napproaches almost employ the BERT-like objectives to learn multi-modal\nrepresentations. These objectives motivated from the text-domain are\ninsufficient for the excavation on the complex scenario of visual modality.\nMost importantly, the spatial distribution of the visual objects is basically\nneglected. To address the above issue, we propose to construct the spatial\nrelation graph based on the given visual scenario. Further, we design two\npre-training tasks named object position regression (OPR) and spatial relation\nclassification (SRC) to learn to reconstruct the spatial relation graph\nrespectively. Quantitative analysis suggests that the proposed method can guide\nthe representations to maintain more spatial context and facilitate the\nattention on the essential visual regions for reasoning. We achieve the\nstate-of-the-art results on VCR and two other vision-and-language reasoning\ntasks VQA, and NLVR.", "comment": null, "links": []}
{"entry_id": "2311.04901", "title": "GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs", "authors": ["Zhenfang Chen", "Rui Sun", "Wenjun Liu", "Yining Hong", "Chuang Gan"], "published": "2023-11-08 18:59:05", "updated": "2023-11-08 18:59:05", "summary": "Recent works have shown that Large Language Models (LLMs) could empower\ntraditional neuro-symbolic models via programming capabilities to translate\nlanguage into module descriptions, thus achieving strong visual reasoning\nresults while maintaining the model's transparency and efficiency. However,\nthese models usually exhaustively generate the entire code snippet given each\nnew instance of a task, which is extremely ineffective. We propose generative\nneuro-symbolic visual reasoning by growing and reusing modules. Specifically,\nour model consists of three unique stages, module initialization, module\ngeneration, and module execution. First, given a vision-language task, we adopt\nLLMs to examine whether we could reuse and grow over established modules to\nhandle this new task. If not, we initialize a new module needed by the task and\nspecify the inputs and outputs of this new module. After that, the new module\nis created by querying LLMs to generate corresponding code snippets that match\nthe requirements. In order to get a better sense of the new module's ability,\nwe treat few-shot training examples as test cases to see if our new module\ncould pass these cases. If yes, the new module is added to the module library\nfor future reuse. Finally, we evaluate the performance of our model on the\ntesting set by executing the parsed programs with the newly made visual modules\nto get the results. We find the proposed model possesses several advantages.\nFirst, it performs competitively on standard tasks like visual question\nanswering and referring expression comprehension; Second, the modules learned\nfrom one task can be seamlessly transferred to new tasks; Last but not least,\nit is able to adapt to new visual reasoning tasks by observing a few training\nexamples and reusing modules.", "comment": null, "links": []}
{"entry_id": "2311.01487", "title": "What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning", "authors": ["Yifan Du", "Hangyu Guo", "Kun Zhou", "Wayne Xin Zhao", "Jinpeng Wang", "Chuyuan Wang", "Mingchen Cai", "Ruihua Song", "Ji-Rong Wen"], "published": "2023-11-02 15:36:12", "updated": "2023-11-02 15:36:12", "summary": "Visual instruction tuning is an essential approach to improving the zero-shot\ngeneralization capability of Multi-modal Large Language Models (MLLMs). A surge\nof visual instruction datasets with various focuses and characteristics have\nbeen proposed recently, enabling MLLMs to achieve surprising results on\nevaluation benchmarks. To develop more capable MLLMs, in this paper, we aim to\ninvestigate a more fundamental question: ``what makes for good visual\ninstructions?''. By conducting a comprehensive empirical study, we find that\ninstructions focused on complex visual reasoning tasks are particularly\neffective in improving the performance of MLLMs on evaluation benchmarks.\nBuilding upon this finding, we design a systematic approach to automatically\ncreating high-quality complex visual reasoning instructions. Our approach\nemploys a synthesis-complication-reformulation paradigm, leveraging multiple\nstages to gradually increase the complexity of the instructions while\nguaranteeing quality. Based on this approach, we create the synthetic visual\nreasoning instruction dataset consisting of 32K examples, namely ComVint, and\nfine-tune four MLLMs on it. Experimental results demonstrate that our dataset\nconsistently enhances the performance of all the compared MLLMs, e.g.,\nimproving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and\n28.8%, respectively. Our code and data are publicly available at the link:\nhttps://github.com/RUCAIBox/ComVint.", "comment": "Work in progress", "links": []}
{"entry_id": "2311.01161", "title": "Weakly Supervised Semantic Parsing with Execution-based Spurious Program Filtering", "authors": ["Kang-il Lee", "Segwang Kim", "Kyomin Jung"], "published": "2023-11-02 11:45:40", "updated": "2023-11-02 11:45:40", "summary": "The problem of spurious programs is a longstanding challenge when training a\nsemantic parser from weak supervision. To eliminate such programs that have\nwrong semantics but correct denotation, existing methods focus on exploiting\nsimilarities between examples based on domain-specific knowledge. In this\npaper, we propose a domain-agnostic filtering mechanism based on program\nexecution results. Specifically, for each program obtained through the search\nprocess, we first construct a representation that captures the program's\nsemantics as execution results under various inputs. Then, we run a majority\nvote on these representations to identify and filter out programs with\nsignificantly different semantics from the other programs. In particular, our\nmethod is orthogonal to the program search process so that it can easily\naugment any of the existing weakly supervised semantic parsing frameworks.\nEmpirical evaluations on the Natural Language Visual Reasoning and\nWikiTableQuestions demonstrate that applying our method to the existing\nsemantic parsers induces significantly improved performances.", "comment": "EMNLP 2023", "links": []}
{"entry_id": "2303.12513", "title": "Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding", "authors": ["Morris Alper", "Michael Fiman", "Hadar Averbuch-Elor"], "published": "2023-03-21 17:30:40", "updated": "2023-11-02 10:38:49", "summary": "Most humans use visual imagination to understand and reason about language,\nbut models such as BERT reason about language using knowledge acquired during\ntext-only pretraining. In this work, we investigate whether vision-and-language\npretraining can improve performance on text-only tasks that involve implicit\nvisual reasoning, focusing primarily on zero-shot probing methods. We propose a\nsuite of visual language understanding (VLU) tasks for probing the visual\nreasoning abilities of text encoder models, as well as various non-visual\nnatural language understanding (NLU) tasks for comparison. We also contribute a\nnovel zero-shot knowledge probing method, Stroop probing, for applying models\nsuch as CLIP to text-only tasks without needing a prediction head such as the\nmasked language modelling head of models like BERT. We show that SOTA\nmultimodally trained text encoders outperform unimodally trained text encoders\non the VLU tasks while being underperformed by them on the NLU tasks, lending\nnew context to previously mixed results regarding the NLU capabilities of\nmultimodal models. We conclude that exposure to images during pretraining\naffords inherent visual reasoning knowledge that is reflected in language-only\ntasks that require implicit visual reasoning. Our findings bear importance in\nthe broader context of multimodal learning, providing principled guidelines for\nthe choice of text encoders used in such contexts.", "comment": "Accepted to CVPR 2023. Project webpage:\n  https://isbertblind.github.io/", "links": []}
{"entry_id": "2310.19070", "title": "Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection", "authors": ["Yuanze Li", "Haolin Wang", "Shihao Yuan", "Ming Liu", "Debin Zhao", "Yiwen Guo", "Chen Xu", "Guangming Shi", "Wangmeng Zuo"], "published": "2023-10-29 16:49:45", "updated": "2023-11-01 03:50:52", "summary": "Existing industrial anomaly detection (IAD) methods predict anomaly scores\nfor both anomaly detection and localization. However, they struggle to perform\na multi-turn dialog and detailed descriptions for anomaly regions, e.g., color,\nshape, and categories of industrial anomalies. Recently, large multimodal\n(i.e., vision and language) models (LMMs) have shown eminent perception\nabilities on multiple vision tasks such as image captioning, visual\nunderstanding, visual reasoning, etc., making it a competitive potential choice\nfor more comprehensible anomaly detection. However, the knowledge about anomaly\ndetection is absent in existing general LMMs, while training a specific LMM for\nanomaly detection requires a tremendous amount of annotated data and massive\ncomputation resources. In this paper, we propose a novel large multi-modal\nmodel by applying vision experts for industrial anomaly detection (dubbed\nMyriad), which leads to definite anomaly detection and high-quality anomaly\ndescription. Specifically, we adopt MiniGPT-4 as the base LMM and design an\nExpert Perception module to embed the prior knowledge from vision experts as\ntokens which are intelligible to Large Language Models (LLMs). To compensate\nfor the errors and confusions of vision experts, we introduce a domain adapter\nto bridge the visual representation gaps between generic and industrial images.\nFurthermore, we propose a Vision Expert Instructor, which enables the Q-Former\nto generate IAD domain vision-language tokens according to vision expert prior.\nExtensive experiments on MVTec-AD and VisA benchmarks demonstrate that our\nproposed method not only performs favorably against state-of-the-art methods\nunder the 1-class and few-shot settings, but also provide definite anomaly\nprediction along with detailed descriptions in IAD domain.", "comment": "8 pages, 7 figures", "links": []}
{"entry_id": "2307.09009", "title": "How is ChatGPT's behavior changing over time?", "authors": ["Lingjiao Chen", "Matei Zaharia", "James Zou"], "published": "2023-07-18 06:56:08", "updated": "2023-10-31 16:13:44", "summary": "GPT-3.5 and GPT-4 are the two most widely used large language model (LLM)\nservices. However, when and how these models are updated over time is opaque.\nHere, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on\nseveral diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3)\nopinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating\ncode, 6) US Medical License tests, and 7) visual reasoning. We find that the\nperformance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.\nFor example, GPT-4 (March 2023) was reasonable at identifying prime vs.\ncomposite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same\nquestions (51% accuracy). This is partly explained by a drop in GPT-4's amenity\nto follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in\nJune than in March in this task. GPT-4 became less willing to answer sensitive\nquestions and opinion survey questions in June than in March. GPT-4 performed\nbetter at multi-hop questions in June than in March, while GPT-3.5's\nperformance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting\nmistakes in code generation in June than in March. We provide evidence that\nGPT-4's ability to follow user instructions has decreased over time, which is\none common factor behind the many behavior drifts. Overall, our findings show\nthat the behavior of the \"same\" LLM service can change substantially in a\nrelatively short amount of time, highlighting the need for continuous\nmonitoring of LLMs.", "comment": "add more evaluations on instruction following", "links": []}
{"entry_id": "2206.09203", "title": "Interactive Visual Reasoning under Uncertainty", "authors": ["Manjie Xu", "Guangyuan Jiang", "Wei Liang", "Chi Zhang", "Yixin Zhu"], "published": "2022-06-18 13:32:41", "updated": "2023-10-29 05:04:32", "summary": "One of the fundamental cognitive abilities of humans is to quickly resolve\nuncertainty by generating hypotheses and testing them via active trials.\nEncountering a novel phenomenon accompanied by ambiguous cause-effect\nrelationships, humans make hypotheses against data, conduct inferences from\nobservation, test their theory via experimentation, and correct the proposition\nif inconsistency arises. These iterative processes persist until the underlying\nmechanism becomes clear. In this work, we devise the IVRE (pronounced as\n\"ivory\") environment for evaluating artificial agents' reasoning ability under\nuncertainty. IVRE is an interactive environment featuring rich scenarios\ncentered around Blicket detection. Agents in IVRE are placed into environments\nwith various ambiguous action-effect pairs and asked to determine each object's\nrole. They are encouraged to propose effective and efficient experiments to\nvalidate their hypotheses based on observations and actively gather new\ninformation. The game ends when all uncertainties are resolved or the maximum\nnumber of trials is consumed. By evaluating modern artificial agents in IVRE,\nwe notice a clear failure of today's learning methods compared to humans. Such\ninefficacy in interactive reasoning ability under uncertainty calls for future\nresearch in building human-like intelligence.", "comment": "Accepted at NeurIPS 2023 (Datasets and Benchmarks)", "links": []}
{"entry_id": "2310.18807", "title": "OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning", "authors": ["Rim Assouel", "Pau Rodriguez", "Perouz Taslakian", "David Vazquez", "Yoshua Bengio"], "published": "2023-10-28 20:12:58", "updated": "2023-10-28 20:12:58", "summary": "A key aspect of human intelligence is the ability to imagine -- composing\nlearned concepts in novel ways -- to make sense of new scenarios. Such capacity\nis not yet attained for machine learning systems. In this work, in the context\nof visual reasoning, we show how modularity can be leveraged to derive a\ncompositional data augmentation framework inspired by imagination. Our method,\ndenoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes\nvisual generative reasoning tasks into a series of primitives applied to\nobjects without using a domain-specific language. We show that our modular\narchitectural choices can be used to generate new training tasks that lead to\nbetter out-of-distribution generalization. We compare our model to existing and\nnew baselines in proposed visual reasoning benchmark that consists of applying\narithmetic operations to MNIST digits.", "comment": null, "links": []}
{"entry_id": "2310.18804", "title": "Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting", "authors": ["Hejie Cui", "Xinyu Fang", "Zihan Zhang", "Ran Xu", "Xuan Kan", "Xin Liu", "Yue Yu", "Manling Li", "Yangqiu Song", "Carl Yang"], "published": "2023-10-28 20:09:29", "updated": "2023-10-28 20:09:29", "summary": "Images contain rich relational knowledge that can help machines understand\nthe world. Existing methods on visual knowledge extraction often rely on the\npre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation\ntypes), restricting the expressiveness of the extracted knowledge. In this\nwork, we take a first exploration to a new paradigm of open visual knowledge\nextraction. To achieve this, we present OpenVik which consists of an open\nrelational region detector to detect regions potentially containing relational\nknowledge and a visual knowledge generator that generates format-free knowledge\nby prompting the large multimodality model with the detected region of\ninterest. We also explore two data enhancement techniques for diversifying the\ngenerated format-free visual knowledge. Extensive knowledge quality evaluations\nhighlight the correctness and uniqueness of the extracted open visual knowledge\nby OpenVik. Moreover, integrating our extracted knowledge across various visual\nreasoning applications shows consistent improvements, indicating the real-world\napplicability of OpenVik.", "comment": "Accepted to NeurIPS 2023", "links": []}
{"entry_id": "2310.18046", "title": "ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese", "authors": ["Khiem Vinh Tran", "Hao Phu Phan", "Kiet Van Nguyen", "Ngan Luu Thuy Nguyen"], "published": "2023-10-27 10:44:50", "updated": "2023-10-27 10:44:50", "summary": "In recent years, Visual Question Answering (VQA) has gained significant\nattention for its diverse applications, including intelligent car assistance,\naiding visually impaired individuals, and document image information retrieval\nusing natural language queries. VQA requires effective integration of\ninformation from questions and images to generate accurate answers. Neural\nmodels for VQA have made remarkable progress on large-scale datasets, with a\nprimary focus on resource-rich languages like English. To address this, we\nintroduce the ViCLEVR dataset, a pioneering collection for evaluating various\nvisual reasoning capabilities in Vietnamese while mitigating biases. The\ndataset comprises over 26,000 images and 30,000 question-answer pairs (QAs),\neach question annotated to specify the type of reasoning involved. Leveraging\nthis dataset, we conduct a comprehensive analysis of contemporary visual\nreasoning systems, offering valuable insights into their strengths and\nlimitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion\nthat identifies objects in images based on questions. The architecture\neffectively employs transformers to enable simultaneous reasoning over textual\nand visual data, merging both modalities at an early model stage. The\nexperimental findings demonstrate that our proposed model achieves\nstate-of-the-art performance across four evaluation metrics. The accompanying\ncode and dataset have been made publicly accessible at\n\\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate\nadvancements within the research community, fostering the development of more\nmultimodal fusion algorithms, specifically tailored to address the nuances of\nlow-resource languages, exemplified by Vietnamese.", "comment": "A pre-print version and submitted to journal", "links": []}
{"entry_id": "2303.02260", "title": "Learning to reason over visual objects", "authors": ["Shanka Subhra Mondal", "Taylor Webb", "Jonathan D. Cohen"], "published": "2023-03-03 23:19:42", "updated": "2023-10-26 21:24:47", "summary": "A core component of human intelligence is the ability to identify abstract\npatterns inherent in complex, high-dimensional perceptual data, as exemplified\nby visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated\nby the goal of designing AI systems with this capacity, recent work has focused\non evaluating whether neural networks can learn to solve RPM-like problems.\nPrevious work has generally found that strong performance on these problems\nrequires the incorporation of inductive biases that are specific to the RPM\nproblem format, raising the question of whether such models might be more\nbroadly useful. Here, we investigated the extent to which a general-purpose\nmechanism for processing visual scenes in terms of objects might help promote\nabstract visual reasoning. We found that a simple model, consisting only of an\nobject-centric encoder and a transformer reasoning module, achieved\nstate-of-the-art results on both of two challenging RPM-like benchmarks (PGM\nand I-RAVEN), as well as a novel benchmark with greater visual complexity\n(CLEVR-Matrices). These results suggest that an inductive bias for\nobject-centric processing may be a key component of abstract visual reasoning,\nobviating the need for problem-specific inductive biases.", "comment": "ICLR 2023", "links": []}
{"entry_id": "2310.13447", "title": "Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation", "authors": ["Siyu Zhang", "Yeming Chen", "Sirui Cheng", "Yaoru Sun", "Jun Yang", "Lizhi Bai"], "published": "2023-10-20 12:26:04", "updated": "2023-10-25 13:14:40", "summary": "Within the multimodal field, the key to integrating vision and language lies\nin establishing a good alignment strategy. Recently, benefiting from the\nsuccess of self-supervised learning, significant progress has been made in\nmultimodal semantic representation based on pre-trained models for vision and\nlanguage. However, there is still room for improvement in visual semantic\nrepresentation. The lack of spatial semantic coherence and vulnerability to\nnoise makes it challenging for current pixel or patch-based methods to\naccurately extract complex scene boundaries. To this end, this paper develops\nsuperpixel as a comprehensive compact representation of learnable image data,\nwhich effectively reduces the number of visual primitives for subsequent\nprocessing by clustering perceptually similar pixels. To mine more precise\ntopological relations, we propose a Multiscale Difference Graph Convolutional\nNetwork (MDGCN). It parses the entire image as a fine-to-coarse hierarchical\nstructure of constituent visual patterns, and captures multiscale features by\nprogressively merging adjacent superpixels as graph nodes. Moreover, we predict\nthe differences between adjacent nodes through the graph structure,\nfacilitating key information aggregation of graph nodes to reason actual\nsemantic relations. Afterward, we design a multi-level fusion rule in a\nbottom-up manner to avoid understanding deviation by learning complementary\nspatial information at different regional scales. Our proposed method can be\nwell applied to multiple downstream task learning. Extensive experiments\ndemonstrate that our method is competitive with other state-of-the-art methods\nin visual reasoning. Our code will be released upon publication.", "comment": null, "links": []}
{"entry_id": "2310.16035", "title": "What's Left? Concept Grounding with Logic-Enhanced Foundation Models", "authors": ["Joy Hsu", "Jiayuan Mao", "Joshua B. Tenenbaum", "Jiajun Wu"], "published": "2023-10-24 17:50:20", "updated": "2023-10-24 17:50:20", "summary": "Recent works such as VisProg and ViperGPT have smartly composed foundation\nmodels for visual reasoning-using large language models (LLMs) to produce\nprograms that can be executed by pre-trained vision-language models. However,\nthey operate in limited domains, such as 2D images, not fully exploiting the\ngeneralization of language: abstract concepts like \"left\" can also be grounded\nin 3D, temporal, and action data, as in moving to your left. This limited\ngeneralization stems from these inference-only methods' inability to learn or\nadapt pre-trained models to a new domain. We propose the Logic-Enhanced\nFoundation Model (LEFT), a unified framework that learns to ground and reason\nwith concepts across domains with a differentiable, domain-independent,\nfirst-order logic-based program executor. LEFT has an LLM interpreter that\noutputs a program represented in a general, logic-based reasoning language,\nwhich is shared across all domains and tasks. LEFT's executor then executes the\nprogram with trainable domain-specific grounding modules. We show that LEFT\nflexibly learns concepts in four domains: 2D images, 3D scenes, human motions,\nand robotic manipulation. It exhibits strong reasoning ability in a wide\nvariety of tasks, including those that are complex and not seen during\ntraining, and can be easily applied to new domains.", "comment": "NeurIPS 2023. First two authors contributed equally. Project page:\n  https://web.stanford.edu/~joycj/projects/left_neurips_2023", "links": []}
{"entry_id": "2310.15585", "title": "Multimodal Representations for Teacher-Guided Compositional Visual Reasoning", "authors": ["Wafa Aissa", "Marin Ferecatu", "Michel Crucianu"], "published": "2023-10-24 07:51:08", "updated": "2023-10-24 07:51:08", "summary": "Neural Module Networks (NMN) are a compelling method for visual question\nanswering, enabling the translation of a question into a program consisting of\na series of reasoning sub-tasks that are sequentially executed on the image to\nproduce an answer. NMNs provide enhanced explainability compared to integrated\nmodels, allowing for a better understanding of the underlying reasoning\nprocess. To improve the effectiveness of NMNs we propose to exploit features\nobtained by a large-scale cross-modal encoder. Also, the current training\napproach of NMNs relies on the propagation of module outputs to subsequent\nmodules, leading to the accumulation of prediction errors and the generation of\nfalse answers. To mitigate this, we introduce an NMN learning strategy\ninvolving scheduled teacher guidance. Initially, the model is fully guided by\nthe ground-truth intermediate outputs, but gradually transitions to an\nautonomous behavior as training progresses. This reduces error accumulation,\nthus improving training efficiency and final performance.We demonstrate that by\nincorporating cross-modal features and employing more effective training\ntechniques for NMN, we achieve a favorable balance between performance and\ntransparency in the reasoning process.", "comment": null, "links": []}
{"entry_id": "2310.15166", "title": "Large Language Models are Visual Reasoning Coordinators", "authors": ["Liangyu Chen", "Bo Li", "Sheng Shen", "Jingkang Yang", "Chunyuan Li", "Kurt Keutzer", "Trevor Darrell", "Ziwei Liu"], "published": "2023-10-23 17:59:31", "updated": "2023-10-23 17:59:31", "summary": "Visual reasoning requires multimodal perception and commonsense cognition of\nthe world. Recently, multiple vision-language models (VLMs) have been proposed\nwith excellent commonsense reasoning ability in various domains. However, how\nto harness the collective power of these complementary VLMs is rarely explored.\nExisting methods like ensemble still struggle to aggregate these models with\nthe desired higher-order communications. In this work, we propose Cola, a novel\nparadigm that coordinates multiple VLMs for visual reasoning. Our key insight\nis that a large language model (LLM) can efficiently coordinate multiple VLMs\nby facilitating natural language communication that leverages their distinct\nand complementary capabilities. Extensive experiments demonstrate that our\ninstruction tuning variant, Cola-FT, achieves state-of-the-art performance on\nvisual question answering (VQA), outside knowledge VQA, visual entailment, and\nvisual spatial reasoning tasks. Moreover, we show that our in-context learning\nvariant, Cola-Zero, exhibits competitive performance in zero and few-shot\nsettings, without finetuning. Through systematic ablation studies and\nvisualizations, we validate that a coordinator LLM indeed comprehends the\ninstruction prompts as well as the separate functionalities of VLMs; it then\ncoordinates them to enable impressive visual reasoning capabilities.", "comment": "Accepted at NeurIPS 2023", "links": []}
{"entry_id": "2309.10532", "title": "A Cognitively-Inspired Neural Architecture for Visual Abstract Reasoning Using Contrastive Perceptual and Conceptual Processing", "authors": ["Yuan Yang", "Deepayan Sanyal", "James Ainooson", "Joel Michelson", "Effat Farhana", "Maithilee Kunda"], "published": "2023-09-19 11:18:01", "updated": "2023-10-20 09:02:22", "summary": "We introduce a new neural architecture for solving visual abstract reasoning\ntasks inspired by human cognition, specifically by observations that human\nabstract reasoning often interleaves perceptual and conceptual processing as\npart of a flexible, iterative, and dynamic cognitive process. Inspired by this\nprinciple, our architecture models visual abstract reasoning as an iterative,\nself-contrasting learning process that pursues consistency between perceptual\nand conceptual processing of visual stimuli. We explain how this new\nContrastive Perceptual-Conceptual Network (CPCNet) works using matrix reasoning\nproblems in the style of the well-known Raven's Progressive Matrices\nintelligence test. Experiments on the machine learning dataset RAVEN show that\nCPCNet achieves higher accuracy than all previously published models while also\nusing the weakest inductive bias. We also point out a substantial and\npreviously unremarked class imbalance in the original RAVEN dataset, and we\npropose a new variant of RAVEN -- AB-RAVEN -- that is more balanced in terms of\nabstract concepts.", "comment": null, "links": []}
{"entry_id": "2310.10591", "title": "Interpreting and Controlling Vision Foundation Models via Text Explanations", "authors": ["Haozhe Chen", "Junfeng Yang", "Carl Vondrick", "Chengzhi Mao"], "published": "2023-10-16 17:12:06", "updated": "2023-10-16 17:12:06", "summary": "Large-scale pre-trained vision foundation models, such as CLIP, have become\nde facto backbones for various vision tasks. However, due to their black-box\nnature, understanding the underlying rules behind these models' predictions and\ncontrolling model behaviors have remained open challenges. We present a\nframework for interpreting vision transformer's latent tokens with natural\nlanguage. Given a latent token, our framework retains its semantic information\nto the final layer using transformer's local operations and retrieves the\nclosest text for explanation. Our approach enables understanding of model\nvisual reasoning procedure without needing additional model training or data\ncollection. Based on the obtained interpretations, our framework allows for\nmodel editing that controls model reasoning behaviors and improves model\nrobustness against biases and spurious correlations.", "comment": null, "links": []}
{"entry_id": "2309.16705", "title": "Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual Reasoning", "authors": ["David Noever", "Samantha Elizabeth Miller Noever"], "published": "2023-08-17 03:14:00", "updated": "2023-10-14 19:53:39", "summary": "Addressing the gap in understanding visual comprehension in Large Language\nModels (LLMs), we designed a challenge-response study, subjecting Google Bard\nand GPT-Vision to 64 visual tasks, spanning categories like \"Visual Situational\nReasoning\" and \"Next Scene Prediction.\" Previous models, such as GPT4, leaned\nheavily on optical character recognition tools like Tesseract, whereas Bard and\nGPT-Vision, akin to Google Lens and Visual API, employ deep learning techniques\nfor visual text recognition. However, our findings spotlight both\nvision-language model's limitations: while proficient in solving visual\nCAPTCHAs that stump ChatGPT alone, it falters in recreating visual elements\nlike ASCII art or analyzing Tic Tac Toe grids, suggesting an over-reliance on\neducated visual guesses. The prediction problem based on visual inputs appears\nparticularly challenging with no common-sense guesses for next-scene\nforecasting based on current \"next-token\" multimodal models. This study\nprovides experimental insights into the current capacities and areas for\nimprovement in multimodal LLMs.", "comment": null, "links": []}
{"entry_id": "2309.06659", "title": "Beyond English: Centering Multilingualism in Data Visualization", "authors": ["Noëlle Rakotondravony", "Priya Dhawka", "Melanie Bancilhon"], "published": "2023-09-13 01:17:10", "updated": "2023-10-02 21:01:13", "summary": "Information visualization and natural language are intricately linked.\nHowever, the majority of research and relevant work in information and data\nvisualization (and human-computer interaction) involve English-speaking\npopulations as both researchers and participants, are published in English, and\nare presented predominantly at English-speaking venues. Although several\nsolutions can be proposed such as translating English texts in visualization to\nother languages, there is little research that looks at the intersection of\ndata visualization and different languages, and the implications that current\nvisualization practices have on non-English speaking communities. In this\nposition paper, we argue that linguistically diverse communities abound beyond\nthe English-speaking world and offer a richness of experiences for the\nvisualization research community to engage with. Through a case study of how\ntwo non-English languages interplay with data visualization reasoning in\nMadagascar, we describe how monolingualism in data visualization impacts the\nexperiences of underrepresented populations and emphasize potential harm to\nthese communities. Lastly, we raise several questions towards advocating for\nmore inclusive visualization practices that center the diverse experiences of\nlinguistically underrepresented populations.", "comment": "5 pages, 1 figure, Visualization for Social Good @VIS23", "links": []}
{"entry_id": "2308.16463", "title": "Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models", "authors": ["Yupan Huang", "Zaiqiao Meng", "Fangyu Liu", "Yixuan Su", "Nigel Collier", "Yutong Lu"], "published": "2023-08-31 05:15:27", "updated": "2023-10-02 03:31:17", "summary": "Large language models exhibit enhanced zero-shot performance on various tasks\nwhen fine-tuned with instruction-following data. Multimodal\ninstruction-following models extend these capabilities by integrating both text\nand images. However, existing models such as MiniGPT-4 face challenges in\nmaintaining dialogue coherence in scenarios involving multiple images. A\nprimary reason is the lack of a specialized dataset for this critical\napplication. To bridge these gaps, we present SparklesChat, a multimodal\ninstruction-following model for open-ended dialogues across multiple images. To\nsupport the training, we introduce SparklesDialogue, the first\nmachine-generated dialogue dataset tailored for word-level interleaved\nmulti-image and text interactions. Furthermore, we construct SparklesEval, a\nGPT-assisted benchmark for quantitatively assessing a model's conversational\ncompetence across multiple images and dialogue turns. Our experiments validate\nthe effectiveness of SparklesChat in understanding and reasoning across\nmultiple images and dialogue turns. Specifically, SparklesChat outperformed\nMiniGPT-4 on established vision-and-language benchmarks, including the BISON\nbinary image selection task and the NLVR2 visual reasoning task. Moreover,\nSparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding\nMiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative\nevaluations further demonstrate SparklesChat's generality in handling\nreal-world applications. All resources are available at\nhttps://github.com/HYPJUDY/Sparkles.", "comment": "Reduced main content to 9 pages; typos corrected", "links": []}
{"entry_id": "2305.10503", "title": "OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields", "authors": ["Youtan Yin", "Zhoujie Fu", "Fan Yang", "Guosheng Lin"], "published": "2023-05-17 18:18:05", "updated": "2023-09-29 02:36:03", "summary": "The emergence of Neural Radiance Fields (NeRF) for novel view synthesis has\nincreased interest in 3D scene editing. An essential task in editing is\nremoving objects from a scene while ensuring visual reasonability and multiview\nconsistency. However, current methods face challenges such as time-consuming\nobject labeling, limited capability to remove specific targets, and compromised\nrendering quality after removal. This paper proposes a novel object-removing\npipeline, named OR-NeRF, that can remove objects from 3D scenes with user-given\npoints or text prompts on a single view, achieving better performance in less\ntime than previous works. Our method spreads user annotations to all views\nthrough 3D geometry and sparse correspondence, ensuring 3D consistency with\nless processing burden. Then recent 2D segmentation model Segment-Anything\n(SAM) is applied to predict masks, and a 2D inpainting model is used to\ngenerate color supervision. Finally, our algorithm applies depth supervision\nand perceptual loss to maintain consistency in geometry and appearance after\nobject removal. Experimental results demonstrate that our method achieves\nbetter editing quality with less time than previous works, considering both\nquality and quantity.", "comment": "project site: https://ornerf.github.io/ (codes available)", "links": []}
{"entry_id": "2309.08587", "title": "Compositional Foundation Models for Hierarchical Planning", "authors": ["Anurag Ajay", "Seungwook Han", "Yilun Du", "Shuang Li", "Abhi Gupta", "Tommi Jaakkola", "Josh Tenenbaum", "Leslie Kaelbling", "Akash Srivastava", "Pulkit Agrawal"], "published": "2023-09-15 17:44:05", "updated": "2023-09-21 14:49:20", "summary": "To make effective decisions in novel environments with long-horizon goals, it\nis crucial to engage in hierarchical reasoning across spatial and temporal\nscales. This entails planning abstract subgoal sequences, visually reasoning\nabout the underlying plans, and executing actions in accordance with the\ndevised plan through visual-motor control. We propose Compositional Foundation\nModels for Hierarchical Planning (HiP), a foundation model which leverages\nmultiple expert foundation model trained on language, vision and action data\nindividually jointly together to solve long-horizon tasks. We use a large\nlanguage model to construct symbolic plans that are grounded in the environment\nthrough a large video diffusion model. Generated video plans are then grounded\nto visual-motor control, through an inverse dynamics model that infers actions\nfrom generated videos. To enable effective reasoning within this hierarchy, we\nenforce consistency between the models via iterative refinement. We illustrate\nthe efficacy and adaptability of our approach in three different long-horizon\ntable-top manipulation tasks.", "comment": "Website: https://hierarchical-planning-foundation-model.github.io/", "links": []}
{"entry_id": "2309.11080", "title": "Visual Question Answering in the Medical Domain", "authors": ["Louisa Canepa", "Sonit Singh", "Arcot Sowmya"], "published": "2023-09-20 06:06:10", "updated": "2023-09-20 06:06:10", "summary": "Medical visual question answering (Med-VQA) is a machine learning task that\naims to create a system that can answer natural language questions based on\ngiven medical images. Although there has been rapid progress on the general VQA\ntask, less progress has been made on Med-VQA due to the lack of large-scale\nannotated datasets. In this paper, we present domain-specific pre-training\nstrategies, including a novel contrastive learning pretraining method, to\nmitigate the problem of small datasets for the Med-VQA task. We find that the\nmodel benefits from components that use fewer parameters. We also evaluate and\ndiscuss the model's visual reasoning using evidence verification techniques.\nOur proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set,\ngiving comparable results to other state-of-the-art Med-VQA models.", "comment": "8 pages, 7 figures, Accepted to DICTA 2023 Conference", "links": []}
{"entry_id": "2308.09033", "title": "Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks", "authors": ["Fawaz Sammani", "Nikos Deligiannis"], "published": "2023-08-17 15:15:55", "updated": "2023-09-19 09:38:18", "summary": "Natural Language Explanations (NLE) aim at supplementing the prediction of a\nmodel with human-friendly natural text. Existing NLE approaches involve\ntraining separate models for each downstream task. In this work, we propose\nUni-NLX, a unified framework that consolidates all NLE tasks into a single and\ncompact multi-task model using a unified training objective of text generation.\nAdditionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of\n144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of\n123K samples for explaining the task of Visual Question Answering (VQA). Both\ndatasets are derived leveraging large language models (LLMs). By training on\nthe 1M combined NLE samples, our single unified framework is capable of\nsimultaneously performing seven NLE tasks including VQA, visual recognition and\nvisual reasoning tasks with 7X fewer parameters, demonstrating comparable\nperformance to the independent task-specific models in previous approaches, and\nin certain tasks even outperforming them. Code is at\nhttps://github.com/fawazsammani/uni-nlx", "comment": "Accepted to ICCVW 2023", "links": []}
{"entry_id": "2202.04053", "title": "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models", "authors": ["Jaemin Cho", "Abhay Zala", "Mohit Bansal"], "published": "2022-02-08 18:36:52", "updated": "2023-08-30 18:41:01", "summary": "Recently, DALL-E, a multimodal transformer language model, and its variants,\nincluding diffusion models, have shown high-quality text-to-image generation\ncapabilities. However, despite the realistic image generation results, there\nhas not been a detailed analysis of how to evaluate such models. In this work,\nwe investigate the visual reasoning capabilities and social biases of different\ntext-to-image models, covering both multimodal transformer language models and\ndiffusion models. First, we measure three visual reasoning skills: object\nrecognition, object counting, and spatial relation understanding. For this, we\npropose PaintSkills, a compositional diagnostic evaluation dataset that\nmeasures these skills. Despite the high-fidelity image generation capability, a\nlarge gap exists between the performance of recent models and the upper bound\naccuracy in object counting and spatial relation understanding skills. Second,\nwe assess the gender and skin tone biases by measuring the gender/skin tone\ndistribution of generated images across various professions and attributes. We\ndemonstrate that recent text-to-image generation models learn specific biases\nabout gender and skin tone from web image-text pairs. We hope our work will\nhelp guide future progress in improving text-to-image generation models on\nvisual reasoning skills and learning socially unbiased representations. Code\nand data: https://github.com/j-min/DallEval", "comment": "ICCV 2023 (34 pages; see appendix for version changelog)", "links": []}
{"entry_id": "2308.15887", "title": "On the Potential of CLIP for Compositional Logical Reasoning", "authors": ["Justin Brody"], "published": "2023-08-30 09:04:24", "updated": "2023-08-30 09:04:24", "summary": "In this paper we explore the possibility of using OpenAI's CLIP to perform\nlogically coherent grounded visual reasoning. To that end, we formalize our\nterms and give a geometric analysis of how embeddings in CLIP's latent space\nwould need to be configured in order for the system to be logically coherent.\nOur main conclusion is that, as usually configured, CLIP cannot perform such\nreasoning.", "comment": "In Proceedings ICLP 2023, arXiv:2308.14898", "links": ["http://dx.doi.org/10.4204/EPTCS.385.10"]}
{"entry_id": "2202.08806", "title": "Grammar-Based Grounded Lexicon Learning", "authors": ["Jiayuan Mao", "Haoyue Shi", "Jiajun Wu", "Roger P. Levy", "Joshua B. Tenenbaum"], "published": "2022-02-17 18:19:53", "updated": "2023-08-24 17:46:12", "summary": "We present Grammar-Based Grounded Lexicon Learning (G2L2), a lexicalist\napproach toward learning a compositional and grounded meaning representation of\nlanguage from grounded data, such as paired images and texts. At the core of\nG2L2 is a collection of lexicon entries, which map each word to a tuple of a\nsyntactic type and a neuro-symbolic semantic program. For example, the word\nshiny has a syntactic type of adjective; its neuro-symbolic semantic program\nhas the symbolic form {\\lambda}x. filter(x, SHINY), where the concept SHINY is\nassociated with a neural network embedding, which will be used to classify\nshiny objects. Given an input sentence, G2L2 first looks up the lexicon entries\nassociated with each token. It then derives the meaning of the sentence as an\nexecutable neuro-symbolic program by composing lexical meanings based on\nsyntax. The recovered meaning programs can be executed on grounded inputs. To\nfacilitate learning in an exponentially-growing compositional space, we\nintroduce a joint parsing and expected execution algorithm, which does local\nmarginalization over derivations to reduce the training time. We evaluate G2L2\non two domains: visual reasoning and language-driven navigation. Results show\nthat G2L2 can generalize from small amounts of data to novel compositions of\nwords.", "comment": "Minor typo fixes. NeurIPS 2021. Project page:\n  https://g2l2.csail.mit.edu/", "links": []}
{"entry_id": "2308.09658", "title": "Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning", "authors": ["Pengbo Hu", "Ji Qi", "Xingyu Li", "Hong Li", "Xinqi Wang", "Bing Quan", "Ruiyu Wang", "Yi Zhou"], "published": "2023-08-18 16:21:40", "updated": "2023-08-21 03:08:52", "summary": "There emerges a promising trend of using large language models (LLMs) to\ngenerate code-like plans for complex inference tasks such as visual reasoning.\nThis paradigm, known as LLM-based planning, provides flexibility in problem\nsolving and endows better interpretability. However, current research is mostly\nlimited to basic scenarios of simple questions that can be straightforward\nanswered in a few inference steps. Planning for the more challenging multi-hop\nvisual reasoning tasks remains under-explored. Specifically, under multi-hop\nreasoning situations, the trade-off between accuracy and the complexity of\nplan-searching becomes prominent. The prevailing algorithms either address the\nefficiency issue by employing the fast one-stop generation or adopt a complex\niterative generation method to improve accuracy. Both fail to balance the need\nfor efficiency and performance. Drawing inspiration from the dual system of\ncognition in the human brain, the fast and the slow think processes, we propose\na hierarchical plan-searching algorithm that integrates the one-stop reasoning\n(fast) and the Tree-of-thought (slow). Our approach succeeds in performance\nwhile significantly saving inference steps. Moreover, we repurpose the PTR and\nthe CLEVER datasets, developing a systematic framework for evaluating the\nperformance and efficiency of LLMs-based plan-search algorithms under reasoning\ntasks at different levels of difficulty. Extensive experiments demonstrate the\nsuperiority of our proposed algorithm in terms of performance and efficiency.\nThe dataset and code will be release soon.", "comment": "16 pages,1 figures, under review", "links": []}
{"entry_id": "2303.07274", "title": "Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images", "authors": ["Nitzan Bitton-Guetta", "Yonatan Bitton", "Jack Hessel", "Ludwig Schmidt", "Yuval Elovici", "Gabriel Stanovsky", "Roy Schwartz"], "published": "2023-03-13 16:49:43", "updated": "2023-08-12 22:37:31", "summary": "Weird, unusual, and uncanny images pique the curiosity of observers because\nthey challenge commonsense. For example, an image released during the 2022\nworld cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo\nplaying chess, which playfully violates our expectation that their competition\nshould occur on the football field. Humans can easily recognize and interpret\nthese unconventional images, but can AI models do the same? We introduce\nWHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is\ncomprised of purposefully commonsense-defying images created by designers using\npublicly-available image generation tools like Midjourney. We consider several\ntasks posed over the dataset. In addition to image captioning, cross-modal\nmatching, and visual question answering, we introduce a difficult explanation\ngeneration task, where models must identify and explain why a given image is\nunusual. Our results show that state-of-the-art models such as GPT3 and BLIP2\nstill lag behind human performance on WHOOPS!. We hope our dataset will inspire\nthe development of AI models with stronger visual commonsense reasoning\nabilities. Data, models and code are available at the project website:\nwhoops-benchmark.github.io", "comment": "Accepted to ICCV 2023. Website: whoops-benchmark.github.io", "links": []}
{"entry_id": "2307.16395", "title": "Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks", "authors": ["Kousik Rajesh", "Mrigank Raman", "Mohammed Asad Karim", "Pranit Chawla"], "published": "2023-07-31 03:57:31", "updated": "2023-07-31 03:57:31", "summary": "In recent times there has been a surge of multi-modal architectures based on\nLarge Language Models, which leverage the zero shot generation capabilities of\nLLMs and project image embeddings into the text space and then use the\nauto-regressive capacity to solve tasks such as VQA, captioning, and image\nretrieval. We name these architectures as \"bridge-architectures\" as they\nproject from the image space to the text space. These models deviate from the\ntraditional recipe of training transformer based multi-modal models, which\ninvolve using large-scale pre-training and complex multi-modal interactions\nthrough co or cross attention. However, the capabilities of bridge\narchitectures have not been tested on complex visual reasoning tasks which\nrequire fine grained analysis about the image. In this project, we investigate\nthe performance of these bridge-architectures on the NLVR2 dataset, and compare\nit to state-of-the-art transformer based architectures. We first extend the\ntraditional bridge architectures for the NLVR2 dataset, by adding object level\nfeatures to faciliate fine-grained object reasoning. Our analysis shows that\nadding object level features to bridge architectures does not help, and that\npre-training on multi-modal data is key for good performance on complex\nreasoning tasks such as NLVR2. We also demonstrate some initial results on a\nrecently bridge-architecture, LLaVA, in the zero shot setting and analyze its\nperformance.", "comment": null, "links": []}
{"entry_id": "2307.14142", "title": "LOIS: Looking Out of Instance Semantics for Visual Question Answering", "authors": ["Siyu Zhang", "Yeming Chen", "Yaoru Sun", "Fang Wang", "Haibo Shi", "Haoran Wang"], "published": "2023-07-26 12:13:00", "updated": "2023-07-26 12:13:00", "summary": "Visual question answering (VQA) has been intensively studied as a multimodal\ntask that requires effort in bridging vision and language to infer answers\ncorrectly. Recent attempts have developed various attention-based modules for\nsolving VQA tasks. However, the performance of model inference is largely\nbottlenecked by visual processing for semantics understanding. Most existing\ndetection methods rely on bounding boxes, remaining a serious challenge for VQA\nmodels to understand the causal nexus of object semantics in images and\ncorrectly infer contextual information. To this end, we propose a finer model\nframework without bounding boxes in this work, termed Looking Out of Instance\nSemantics (LOIS) to tackle this important issue. LOIS enables more fine-grained\nfeature descriptions to produce visual facts. Furthermore, to overcome the\nlabel ambiguity caused by instance masks, two types of relation attention\nmodules: 1) intra-modality and 2) inter-modality, are devised to infer the\ncorrect answers from the different multi-view features. Specifically, we\nimplement a mutual relation attention module to model sophisticated and deeper\nvisual semantic relations between instance objects and background information.\nIn addition, our proposed attention model can further analyze salient image\nregions by focusing on important word-related questions. Experimental results\non four benchmark VQA datasets prove that our proposed method has favorable\nperformance in improving visual reasoning capability.", "comment": null, "links": []}
{"entry_id": "2210.05335", "title": "MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model", "authors": ["Yatai Ji", "Junjie Wang", "Yuan Gong", "Lin Zhang", "Yanru Zhu", "Hongfa Wang", "Jiaxing Zhang", "Tetsuya Sakai", "Yujiu Yang"], "published": "2022-10-11 10:54:54", "updated": "2023-07-20 16:24:14", "summary": "Multimodal semantic understanding often has to deal with uncertainty, which\nmeans the obtained messages tend to refer to multiple targets. Such uncertainty\nis problematic for our interpretation, including inter- and intra-modal\nuncertainty. Little effort has studied the modeling of this uncertainty,\nparticularly in pre-training on unlabeled datasets and fine-tuning in\ntask-specific downstream datasets. In this paper, we project the\nrepresentations of all modalities as probabilistic distributions via a\nProbability Distribution Encoder (PDE) by utilizing sequence-level\ninteractions. Compared to the existing deterministic methods, such uncertainty\nmodeling can convey richer multimodal semantic information and more complex\nrelationships. Furthermore, we integrate uncertainty modeling with popular\npre-training frameworks and propose suitable pre-training tasks:\nDistribution-based Vision-Language Contrastive learning (D-VLC),\nDistribution-based Masked Language Modeling (D-MLM), and Distribution-based\nImage-Text Matching (D-ITM). The fine-tuned models are applied to challenging\ndownstream tasks, including image-text retrieval, visual question answering,\nvisual reasoning, and visual entailment, and achieve state-of-the-art results.", "comment": "CVPR 2023 Main Track Long Paper", "links": []}
{"entry_id": "2307.09437", "title": "Unsupervised Conditional Slot Attention for Object Centric Learning", "authors": ["Avinash Kori", "Francesco Locatello", "Francesca Toni", "Ben Glocker"], "published": "2023-07-18 17:11:55", "updated": "2023-07-18 17:11:55", "summary": "Extracting object-level representations for downstream reasoning tasks is an\nemerging area in AI. Learning object-centric representations in an unsupervised\nsetting presents multiple challenges, a key one being binding an arbitrary\nnumber of object instances to a specialized object slot. Recent object-centric\nrepresentation methods like Slot Attention utilize iterative attention to learn\ncomposable representations with dynamic inference level binding but fail to\nachieve specialized slot level binding. To address this, in this paper we\npropose Unsupervised Conditional Slot Attention using a novel Probabilistic\nSlot Dictionary (PSD). We define PSD with (i) abstract object-level property\nvectors as key and (ii) parametric Gaussian distribution as its corresponding\nvalue. We demonstrate the benefits of the learnt specific object-level\nconditioning distributions in multiple downstream tasks, namely object\ndiscovery, compositional scene generation, and compositional visual reasoning.\nWe show that our method provides scene composition capabilities and a\nsignificant boost in a few shot adaptability tasks of compositional visual\nreasoning, while performing similarly or better than slot attention in object\ndiscovery tasks", "comment": null, "links": []}
{"entry_id": "2307.07734", "title": "Abstracting Concept-Changing Rules for Solving Raven's Progressive Matrix Problems", "authors": ["Fan Shi", "Bin Li", "Xiangyang Xue"], "published": "2023-07-15 07:16:38", "updated": "2023-07-15 07:16:38", "summary": "The abstract visual reasoning ability in human intelligence benefits\ndiscovering underlying rules in the novel environment. Raven's Progressive\nMatrix (RPM) is a classic test to realize such ability in machine intelligence\nby selecting from candidates. Recent studies suggest that solving RPM in an\nanswer-generation way boosts a more in-depth understanding of rules. However,\nexisting generative solvers cannot discover the global concept-changing rules\nwithout auxiliary supervision (e.g., rule annotations and distractors in\ncandidate sets). To this end, we propose a deep latent variable model for\nConcept-changing Rule ABstraction (CRAB) by learning interpretable concepts and\nparsing concept-changing rules in the latent space. With the iterative learning\nprocess, CRAB can automatically abstract global rules shared on the dataset on\neach concept and form the learnable prior knowledge of global rules. CRAB\noutperforms the baselines trained without auxiliary supervision in the\narbitrary-position answer generation task and achieves comparable and even\nhigher accuracy than the compared models trained with auxiliary supervision.\nFinally, we conduct experiments to illustrate the interpretability of CRAB in\nconcept learning, answer selection, and global rule abstraction.", "comment": null, "links": []}
{"entry_id": "2307.00928", "title": "Learning Differentiable Logic Programs for Abstract Visual Reasoning", "authors": ["Hikaru Shindo", "Viktor Pfanschilling", "Devendra Singh Dhami", "Kristian Kersting"], "published": "2023-07-03 11:02:40", "updated": "2023-07-03 11:02:40", "summary": "Visual reasoning is essential for building intelligent agents that understand\nthe world and perform problem-solving beyond perception. Differentiable forward\nreasoning has been developed to integrate reasoning with gradient-based machine\nlearning paradigms. However, due to the memory intensity, most existing\napproaches do not bring the best of the expressivity of first-order logic,\nexcluding a crucial ability to solve abstract visual reasoning, where agents\nneed to perform reasoning by using analogies on abstract concepts in different\nscenarios. To overcome this problem, we propose NEUro-symbolic Message-pAssiNg\nreasoNer (NEUMANN), which is a graph-based differentiable forward reasoner,\npassing messages in a memory-efficient manner and handling structured programs\nwith functors. Moreover, we propose a computationally-efficient structure\nlearning algorithm to perform explanatory program induction on complex visual\nscenes. To evaluate, in addition to conventional visual reasoning tasks, we\npropose a new task, visual reasoning behind-the-scenes, where agents need to\nlearn abstract programs and then answer queries by imagining scenes that are\nnot observed. We empirically demonstrate that NEUMANN solves visual reasoning\ntasks efficiently, outperforming neural, symbolic, and neuro-symbolic\nbaselines.", "comment": "under review", "links": []}
{"entry_id": "2306.07743", "title": "V-LoL: A Diagnostic Dataset for Visual Logical Learning", "authors": ["Lukas Helff", "Wolfgang Stammer", "Hikaru Shindo", "Devendra Singh Dhami", "Kristian Kersting"], "published": "2023-06-13 13:00:10", "updated": "2023-07-03 10:24:33", "summary": "Despite the successes of recent developments in visual AI, different\nshortcomings still exist; from missing exact logical reasoning, to abstract\ngeneralization abilities, to understanding complex and noisy scenes.\nUnfortunately, existing benchmarks, were not designed to capture more than a\nfew of these aspects. Whereas deep learning datasets focus on visually complex\ndata but simple visual reasoning tasks, inductive logic datasets involve\ncomplex logical learning tasks, however, lack the visual component. To address\nthis, we propose the visual logical learning dataset, V-LoL, that seamlessly\ncombines visual and logical challenges. Notably, we introduce the first\ninstantiation of V-LoL, V-LoL-Trains, -- a visual rendition of a classic\nbenchmark in symbolic AI, the Michalski train problem. By incorporating\nintricate visual scenes and flexible logical reasoning tasks within a versatile\nframework, V-LoL-Trains provides a platform for investigating a wide range of\nvisual logical learning challenges. We evaluate a variety of AI systems\nincluding traditional symbolic AI, neural AI, as well as neuro-symbolic AI. Our\nevaluations demonstrate that even state-of-the-art AI faces difficulties in\ndealing with visual logical learning challenges, highlighting unique advantages\nand limitations specific to each methodology. Overall, V-LoL opens up new\navenues for understanding and enhancing current abilities in visual logical\nlearning for AI systems.", "comment": null, "links": []}
{"entry_id": "2306.16774", "title": "Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages", "authors": ["Yasmine Karoui", "Rémi Lebret", "Negar Foroutan", "Karl Aberer"], "published": "2023-06-29 08:20:57", "updated": "2023-06-29 08:20:57", "summary": "Vision-Language Pre-training (VLP) has advanced the performance of many\nvision-language tasks, such as image-text retrieval, visual entailment, and\nvisual reasoning. The pre-training mostly utilizes lexical databases and image\nqueries in English. Previous work has demonstrated that the pre-training in\nEnglish does not transfer well to other languages in a zero-shot setting.\nHowever, multilingual pre-trained language models (MPLM) have excelled at a\nvariety of single-modal language tasks. In this paper, we propose a simple yet\nefficient approach to adapt VLP to unseen languages using MPLM. We utilize a\ncross-lingual contextualized token embeddings alignment approach to train text\nencoders for non-English languages. Our approach does not require image input\nand primarily uses machine translation, eliminating the need for target\nlanguage data. Our evaluation across three distinct tasks (image-text\nretrieval, visual entailment, and natural language visual reasoning)\ndemonstrates that this approach outperforms the state-of-the-art multilingual\nvision-language models without requiring large parallel corpora. Our code is\navailable at https://github.com/Yasminekaroui/CliCoTea.", "comment": "Accepted to ACL 2023 as short paper", "links": []}
{"entry_id": "2306.14650", "title": "PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture", "authors": ["Mohit Vaishnav"], "published": "2023-06-26 12:40:12", "updated": "2023-06-28 08:22:14", "summary": "We investigate the role of attention and memory in complex reasoning tasks.\nWe analyze Transformer-based self-attention as a model and extend it with\nmemory. By studying a synthetic visual reasoning test, we refine the taxonomy\nof reasoning tasks. Incorporating self-attention with ResNet50, we enhance\nfeature maps using feature-based and spatial attention, achieving efficient\nsolving of challenging visual reasoning tasks. Our findings contribute to\nunderstanding the attentional needs of SVRT tasks. Additionally, we propose\nGAMR, a cognitive architecture combining attention and memory, inspired by\nactive vision theory. GAMR outperforms other architectures in sample\nefficiency, robustness, and compositionality, and shows zero-shot\ngeneralization on new reasoning tasks.", "comment": "PhD Thesis, 152 pages, 32 figures, 6 tables", "links": []}
{"entry_id": "2303.04091", "title": "Abstract Visual Reasoning Enabled by Language", "authors": ["Giacomo Camposampiero", "Loic Houmard", "Benjamin Estermann", "Joël Mathys", "Roger Wattenhofer"], "published": "2023-03-07 17:52:46", "updated": "2023-06-22 10:41:41", "summary": "While artificial intelligence (AI) models have achieved human or even\nsuperhuman performance in many well-defined applications, they still struggle\nto show signs of broad and flexible intelligence. The Abstraction and Reasoning\nCorpus (ARC), a visual intelligence benchmark introduced by Fran\\c{c}ois\nChollet, aims to assess how close AI systems are to human-like cognitive\nabilities. Most current approaches rely on carefully handcrafted\ndomain-specific program searches to brute-force solutions for the tasks present\nin ARC. In this work, we propose a general learning-based framework for solving\nARC. It is centered on transforming tasks from the vision to the language\ndomain. This composition of language and vision allows for pre-trained models\nto be leveraged at each stage, enabling a shift from handcrafted priors towards\nthe learned priors of the models. While not yet beating state-of-the-art models\non ARC, we demonstrate the potential of our approach, for instance, by solving\nsome ARC tasks that have not been solved previously.", "comment": "The first two authors have contributed equally to this work. Accepted\n  as regular paper at CVPR 2023 Workshop and Challenges for New Frontiers in\n  Visual Language Reasoning: Compositionality, Prompts and Causality (NFVLR)", "links": []}
{"entry_id": "2210.04183", "title": "MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning", "authors": ["Zijia Zhao", "Longteng Guo", "Xingjian He", "Shuai Shao", "Zehuan Yuan", "Jing Liu"], "published": "2022-10-09 06:31:15", "updated": "2023-06-14 07:26:20", "summary": "Multimodal representation learning has shown promising improvements on\nvarious vision-language tasks. Most existing methods excel at building\nglobal-level alignment between vision and language while lacking effective\nfine-grained image-text interaction. In this paper, we propose a jointly masked\nmultimodal modeling method to learn fine-grained multimodal representations.\nOur method performs joint masking on image-text input and integrates both\nimplicit and explicit targets for the masked signals to recover. The implicit\ntarget provides a unified and debiased objective for vision and language, where\nthe model predicts latent multimodal representations of the unmasked input. The\nexplicit target further enriches the multimodal representations by recovering\nhigh-level and semantically meaningful information: momentum visual features of\nimage patches and concepts of word tokens. Through such a masked modeling\nprocess, our model not only learns fine-grained multimodal interaction, but\nalso avoids the semantic gap between high-level representations and low- or\nmid-level prediction targets (e.g. image pixels), thus producing semantically\nrich multimodal representations that perform well on both zero-shot and\nfine-tuned settings. Our pre-trained model (named MAMO) achieves\nstate-of-the-art performance on various downstream vision-language tasks,\nincluding image-text retrieval, visual question answering, visual reasoning,\nand weakly-supervised visual grounding.", "comment": "SIGIR 2023, 10 pages", "links": ["http://dx.doi.org/10.1145/3539618.3591721"]}
{"entry_id": "2306.06272", "title": "A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds", "authors": ["Shiwali Mohan", "Wiktor Piotrowski", "Roni Stern", "Sachin Grover", "Sookyung Kim", "Jacob Le", "Johan De Kleer"], "published": "2023-06-09 21:54:13", "updated": "2023-06-09 21:54:13", "summary": "Model-based reasoning agents are ill-equipped to act in novel situations in\nwhich their model of the environment no longer sufficiently represents the\nworld. We propose HYDRA - a framework for designing model-based agents\noperating in mixed discrete-continuous worlds, that can autonomously detect\nwhen the environment has evolved from its canonical setup, understand how it\nhas evolved, and adapt the agents' models to perform effectively. HYDRA is\nbased upon PDDL+, a rich modeling language for planning in mixed,\ndiscrete-continuous environments. It augments the planning module with visual\nreasoning, task selection, and action execution modules for closed-loop\ninteraction with complex environments. HYDRA implements a novel meta-reasoning\nprocess that enables the agent to monitor its own behavior from a variety of\naspects. The process employs a diverse set of computational methods to maintain\nexpectations about the agent's own behavior in an environment. Divergences from\nthose expectations are useful in detecting when the environment has evolved and\nidentifying opportunities to adapt the underlying models. HYDRA builds upon\nideas from diagnosis and repair and uses a heuristics-guided search over model\nchanges such that they become competent in novel conditions. The HYDRA\nframework has been used to implement novelty-aware agents for three diverse\ndomains - CartPole++ (a higher dimension variant of a classic control problem),\nScience Birds (an IJCAI competition problem), and PogoStick (a specific problem\ndomain in Minecraft). We report empirical observations from these domains to\ndemonstrate the efficacy of various components in the novelty meta-reasoning\nprocess.", "comment": "Under review in Artificial Intelligence Journal - Open World Learning\n  track", "links": []}
{"entry_id": "2212.09737", "title": "Position-guided Text Prompt for Vision-Language Pre-training", "authors": ["Alex Jinpeng Wang", "Pan Zhou", "Mike Zheng Shou", "Shuicheng Yan"], "published": "2022-12-19 18:55:43", "updated": "2023-06-07 06:28:18", "summary": "Vision-Language Pre-Training (VLP) has shown promising capabilities to align\nimage and text pairs, facilitating a broad variety of cross-modal learning\ntasks. However, we observe that VLP models often lack the visual\ngrounding/localization capability which is critical for many downstream tasks\nsuch as visual reasoning. In this work, we propose a novel Position-guided Text\nPrompt (PTP) paradigm to enhance the visual grounding ability of cross-modal\nmodels trained with VLP. Specifically, in the VLP phase, PTP divides the image\ninto $N\\times N$ blocks, and identifies the objects in each block through the\nwidely used object detector in VLP. It then reformulates the visual grounding\ntask into a fill-in-the-blank problem given a PTP by encouraging the model to\npredict the objects in the given blocks or regress the blocks of a given\nobject, e.g. filling `P\" or ``O\" in aPTP ``The block P has a O\". This mechanism\nimproves the visual grounding capability of VLP models and thus helps them\nbetter handle various downstream tasks. By introducing PTP into several\nstate-of-the-art VLP frameworks, we observe consistently significant\nimprovements across representative cross-modal learning model architectures and\nseveral benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average\nrecall@1) for ViLT \\cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr)\nfor SOTA BLIP \\cite{blip} baseline. Moreover, PTP achieves comparable results\nwith object-detector based methods, and much faster inference speed since PTP\ndiscards its object detector for inference while the later cannot. Our code and\npre-trained weight will be released at \\url{https://github.com/sail-sg/ptp}.", "comment": "Camera-ready version, code is in https://github.com/sail-sg/ptp", "links": []}
{"entry_id": "2212.00259", "title": "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning", "authors": ["Zhuowan Li", "Xingrui Wang", "Elias Stengel-Eskin", "Adam Kortylewski", "Wufei Ma", "Benjamin Van Durme", "Alan Yuille"], "published": "2022-12-01 03:53:24", "updated": "2023-06-01 03:57:12", "summary": "Visual Question Answering (VQA) models often perform poorly on\nout-of-distribution data and struggle on domain generalization. Due to the\nmulti-modal nature of this task, multiple factors of variation are intertwined,\nmaking generalization difficult to analyze. This motivates us to introduce a\nvirtual benchmark, Super-CLEVR, where different factors in VQA domain shifts\ncan be isolated in order that their effects can be studied independently. Four\nfactors are considered: visual complexity, question redundancy, concept\ndistribution and concept compositionality. With controllably generated data,\nSuper-CLEVR enables us to test VQA methods in situations where the test data\ndiffers from the training data along each of these axes. We study four existing\nmethods, including two neural symbolic methods NSCL and NSVQA, and two\nnon-symbolic methods FiLM and mDETR; and our proposed method, probabilistic\nNSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA\noutperforms other methods on three of the four domain shift factors. Our\nresults suggest that disentangling reasoning and perception, combined with\nprobabilistic uncertainty, form a strong VQA model that is more robust to\ndomain shifts. The dataset and code are released at\nhttps://github.com/Lizw14/Super-CLEVR.", "comment": "Published in CVPR 2023 as Highlight. Data and code are released at\n  https://github.com/Lizw14/Super-CLEVR", "links": []}
{"entry_id": "2211.01994", "title": "lilGym: Natural Language Visual Reasoning with Reinforcement Learning", "authors": ["Anne Wu", "Kianté Brantley", "Noriyuki Kojima", "Yoav Artzi"], "published": "2022-11-03 17:08:26", "updated": "2023-05-29 15:44:36", "summary": "We present lilGym, a new benchmark for language-conditioned reinforcement\nlearning in visual environments. lilGym is based on 2,661 highly-compositional\nhuman-written natural language statements grounded in an interactive visual\nenvironment. We introduce a new approach for exact reward computation in every\npossible world state by annotating all statements with executable Python\nprograms. Each statement is paired with multiple start states and reward\nfunctions to form thousands of distinct Markov Decision Processes of varying\ndifficulty. We experiment with lilGym with different models and learning\nregimes. Our results and analysis show that while existing methods are able to\nachieve non-trivial performance, lilGym forms a challenging open problem.\nlilGym is available at https://lil.nlp.cornell.edu/lilgym/.", "comment": "ACL 2023 Long Paper", "links": []}
{"entry_id": "2305.14676", "title": "GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions", "authors": ["Woojeong Jin", "Subhabrata Mukherjee", "Yu Cheng", "Yelong Shen", "Weizhu Chen", "Ahmed Hassan Awadallah", "Damien Jose", "Xiang Ren"], "published": "2023-05-24 03:33:21", "updated": "2023-05-24 03:33:21", "summary": "Generalization to unseen tasks is an important ability for few-shot learners\nto achieve better zero-/few-shot performance on diverse tasks. However, such\ngeneralization to vision-language tasks including grounding and generation\ntasks has been under-explored; existing few-shot VL models struggle to handle\ntasks that involve object grounding and multiple images such as visual\ncommonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded\nvIsion Language aLigning, a novel VL model that can be generalized to diverse\ntasks including visual question answering, captioning, and grounding tasks with\nno or very few training instances. Specifically, GRILL learns object grounding\nand localization by exploiting object-text alignments, which enables it to\ntransfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model\non various zero-/few-shot VL tasks and show that it consistently surpasses the\nstate-of-the-art few-shot methods.", "comment": "Preprint", "links": []}
{"entry_id": "2210.00858", "title": "Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach", "authors": ["Georgios Tziafas", "Hamidreza Kasaei"], "published": "2022-10-03 12:21:45", "updated": "2023-05-07 17:06:49", "summary": "In this paper we present a neurosymbolic architecture for coupling\nlanguage-guided visual reasoning with robot manipulation. A non-expert human\nuser can prompt the robot using unconstrained natural language, providing a\nreferring expression (REF), a question (VQA), or a grasp action instruction.\nThe system tackles all cases in a task-agnostic fashion through the utilization\nof a shared library of primitive skills. Each primitive handles an independent\nsub-task, such as reasoning about visual attributes, spatial relation\ncomprehension, logic and enumeration, as well as arm control. A language parser\nmaps the input query to an executable program composed of such primitives,\ndepending on the context. While some primitives are purely symbolic operations\n(e.g. counting), others are trainable neural functions (e.g. visual grounding),\ntherefore marrying the interpretability and systematic generalization benefits\nof discrete symbolic approaches with the scalability and representational power\nof deep networks. We generate a 3D vision-and-language synthetic dataset of\ntabletop scenes in a simulation environment to train our approach and perform\nextensive evaluations in both synthetic and real-world scenes. Results showcase\nthe benefits of our approach in terms of accuracy, sample-efficiency, and\nrobustness to the user's vocabulary, while being transferable to real-world\nscenes with few-shot visual fine-tuning. Finally, we integrate our method with\na robot framework and demonstrate how it can serve as an interpretable solution\nfor an interactive object-picking task, both in simulation and with a real\nrobot. We make our datasets available in\nhttps://gtziafas.github.io/neurosymbolic-manipulation.", "comment": "Submitted T-RO", "links": []}
{"entry_id": "2305.01668", "title": "Visual Reasoning: from State to Transformation", "authors": ["Xin Hong", "Yanyan Lan", "Liang Pang", "Jiafeng Guo", "Xueqi Cheng"], "published": "2023-05-02 14:24:12", "updated": "2023-05-02 14:24:12", "summary": "Most existing visual reasoning tasks, such as CLEVR in VQA, ignore an\nimportant factor, i.e.~transformation. They are solely defined to test how well\nmachines understand concepts and relations within static settings, like one\nimage. Such \\textbf{state driven} visual reasoning has limitations in\nreflecting the ability to infer the dynamics between different states, which\nhas shown to be equally important for human cognition in Piaget's theory. To\ntackle this problem, we propose a novel \\textbf{transformation driven} visual\nreasoning (TVR) task. Given both the initial and final states, the target\nbecomes to infer the corresponding intermediate transformation. Following this\ndefinition, a new synthetic dataset namely TRANCE is first constructed on the\nbasis of CLEVR, including three levels of settings, i.e.~Basic (single-step\ntransformation), Event (multi-step transformation), and View (multi-step\ntransformation with variant views). Next, we build another real dataset called\nTRANCO based on COIN, to cover the loss of transformation diversity on TRANCE.\nInspired by human reasoning, we propose a three-staged reasoning framework\ncalled TranNet, including observing, analyzing, and concluding, to test how\nrecent advanced techniques perform on TVR. Experimental results show that the\nstate-of-the-art visual reasoning models perform well on Basic, but are still\nfar from human-level intelligence on Event, View, and TRANCO. We believe the\nproposed new paradigm will boost the development of machine visual reasoning.\nMore advanced methods and new problems need to be investigated in this\ndirection. The resource of TVR is available at\n\\url{https://hongxin2019.github.io/TVR/}.", "comment": "Accepted by TPAMI. arXiv admin note: substantial text overlap with\n  arXiv:2011.13160", "links": ["http://dx.doi.org/10.1109/TPAMI.2023.3268093"]}
{"entry_id": "2210.01338", "title": "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", "authors": ["Xu Yang", "Hanwang Zhang", "Chongyang Gao", "Jianfei Cai"], "published": "2022-10-04 03:09:50", "updated": "2023-04-24 02:27:07", "summary": "Humans tend to decompose a sentence into different parts like \\textsc{sth do\nsth at someplace} and then fill each part with certain content. Inspired by\nthis, we follow the \\textit{principle of modular design} to propose a novel\nimage captioner: learning to Collocate Visual-Linguistic Neural Modules\n(CVLNM). Unlike the \\re{widely used} neural module networks in VQA, where the\nlanguage (\\ie, question) is fully observable, \\re{the task of collocating\nvisual-linguistic modules is more challenging.} This is because the language is\nonly partially observable, for which we need to dynamically collocate the\nmodules during the process of image captioning. To sum up, we make the\nfollowing technical contributions to design and train our CVLNM: 1)\n\\textit{distinguishable module design} -- \\re{four modules in the encoder}\nincluding one linguistic module for function words and three visual modules for\ndifferent content words (\\ie, noun, adjective, and verb) and another linguistic\none in the decoder for commonsense reasoning, 2) a self-attention based\n\\textit{module controller} for robustifying the visual reasoning, 3) a\npart-of-speech based \\textit{syntax loss} imposed on the module controller for\nfurther regularizing the training of our CVLNM. Extensive experiments on the\nMS-COCO dataset show that our CVLNM is more effective, \\eg, achieving a new\nstate-of-the-art 129.5 CIDEr-D, and more robust, \\eg, being less likely to\noverfit to dataset bias and suffering less when fewer training samples are\navailable. Codes are available at \\url{https://github.com/GCYZSL/CVLMN}", "comment": "Accepted to IJCV. Codes are available at\n  https://github.com/GCYZSL/CVLMN", "links": []}
{"entry_id": "2304.07091", "title": "The role of object-centric representations, guided attention, and external memory on generalizing visual relations", "authors": ["Guillermo Puebla", "Jeffrey S. Bowers"], "published": "2023-04-14 12:22:52", "updated": "2023-04-14 12:22:52", "summary": "Visual reasoning is a long-term goal of vision research. In the last decade,\nseveral works have attempted to apply deep neural networks (DNNs) to the task\nof learning visual relations from images, with modest results in terms of the\ngeneralization of the relations learned. In recent years, several innovations\nin DNNs have been developed in order to enable learning abstract relation from\nimages. In this work, we systematically evaluate a series of DNNs that\nintegrate mechanism such as slot attention, recurrently guided attention, and\nexternal memory, in the simplest possible visual reasoning task: deciding\nwhether two objects are the same or different. We found that, although some\nmodels performed better than others in generalizing the same-different relation\nto specific types of images, no model was able to generalize this relation\nacross the board. We conclude that abstract visual reasoning remains largely an\nunresolved challenge for DNNs.", "comment": null, "links": []}
{"entry_id": "2205.13803", "title": "Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions", "authors": ["Huaizu Jiang", "Xiaojian Ma", "Weili Nie", "Zhiding Yu", "Yuke Zhu", "Song-Chun Zhu", "Anima Anandkumar"], "published": "2022-05-27 07:36:29", "updated": "2023-04-13 07:29:12", "summary": "A significant gap remains between today's visual pattern recognition models\nand human-level visual cognition especially when it comes to few-shot learning\nand compositional reasoning of novel concepts. We introduce Bongard-HOI, a new\nvisual reasoning benchmark that focuses on compositional learning of\nhuman-object interactions (HOIs) from natural images. It is inspired by two\ndesirable characteristics from the classical Bongard problems (BPs): 1)\nfew-shot concept learning, and 2) context-dependent reasoning. We carefully\ncurate the few-shot instances with hard negatives, where positive and negative\nimages only disagree on action labels, making mere recognition of object\ncategories insufficient to complete our benchmarks. We also design multiple\ntest sets to systematically study the generalization of visual learning models,\nwhere we vary the overlap of the HOI concepts between the training and test\nsets of few-shot instances, from partial to no overlaps. Bongard-HOI presents a\nsubstantial challenge to today's visual recognition models. The\nstate-of-the-art HOI detection model achieves only 62% accuracy on few-shot\nbinary prediction while even amateur human testers on MTurk have 91% accuracy.\nWith the Bongard-HOI benchmark, we hope to further advance research efforts in\nvisual reasoning, especially in holistic perception-reasoning systems and\nbetter representation learning.", "comment": "CVPR 2022 (oral); First two authors contributed equally; Code:\n  https://github.com/NVlabs/Bongard-HOI", "links": []}
{"entry_id": "2304.05402", "title": "Boosting Cross-task Transferability of Adversarial Patches with Visual Relations", "authors": ["Tony Ma", "Songze Li", "Yisong Xiao", "Shunchang Liu"], "published": "2023-04-11 11:43:57", "updated": "2023-04-11 11:43:57", "summary": "The transferability of adversarial examples is a crucial aspect of evaluating\nthe robustness of deep learning systems, particularly in black-box scenarios.\nAlthough several methods have been proposed to enhance cross-model\ntransferability, little attention has been paid to the transferability of\nadversarial examples across different tasks. This issue has become increasingly\nrelevant with the emergence of foundational multi-task AI systems such as\nVisual ChatGPT, rendering the utility of adversarial samples generated by a\nsingle task relatively limited. Furthermore, these systems often entail\ninferential functions beyond mere recognition-like tasks. To address this gap,\nwe propose a novel Visual Relation-based cross-task Adversarial Patch\ngeneration method called VRAP, which aims to evaluate the robustness of various\nvisual tasks, especially those involving visual reasoning, such as Visual\nQuestion Answering and Image Captioning. VRAP employs scene graphs to combine\nobject recognition-based deception with predicate-based relations elimination,\nthereby disrupting the visual reasoning information shared among inferential\ntasks. Our extensive experiments demonstrate that VRAP significantly surpasses\nprevious methods in terms of black-box transferability across diverse visual\nreasoning tasks.", "comment": null, "links": []}
{"entry_id": "2211.12817", "title": "Reason from Context with Self-supervised Learning", "authors": ["Xiao Liu", "Ankur Sikarwar", "Gabriel Kreiman", "Zenglin Shi", "Mengmi Zhang"], "published": "2022-11-23 10:02:05", "updated": "2023-04-11 07:17:38", "summary": "Self-supervised learning (SSL) learns to capture discriminative visual\nfeatures useful for knowledge transfers. To better accommodate the\nobject-centric nature of current downstream tasks such as object recognition\nand detection, various methods have been proposed to suppress contextual biases\nor disentangle objects from contexts. Nevertheless, these methods may prove\ninadequate in situations where object identity needs to be reasoned from\nassociated context, such as recognizing or inferring tiny or obscured objects.\nAs an initial effort in the SSL literature, we investigate whether and how\ncontextual associations can be enhanced for visual reasoning within SSL\nregimes, by (a) proposing a new Self-supervised method with external memories\nfor Context Reasoning (SeCo), and (b) introducing two new downstream tasks,\nlift-the-flap and object priming, addressing the problems of \"what\" and \"where\"\nin context reasoning. In both tasks, SeCo outperformed all state-of-the-art\n(SOTA) SSL methods by a significant margin. Our network analysis revealed that\nthe proposed external memory in SeCo learns to store prior contextual\nknowledge, facilitating target identity inference in the lift-the-flap task.\nMoreover, we conducted psychophysics experiments and introduced a Human\nbenchmark in Object Priming dataset (HOP). Our results demonstrate that SeCo\nexhibits human-like behaviors.", "comment": null, "links": []}
{"entry_id": "2304.04399", "title": "CAVL: Learning Contrastive and Adaptive Representations of Vision and Language", "authors": ["Shentong Mo", "Jingfei Xia", "Ihor Markevych"], "published": "2023-04-10 05:54:03", "updated": "2023-04-10 05:54:03", "summary": "Visual and linguistic pre-training aims to learn vision and language\nrepresentations together, which can be transferred to visual-linguistic\ndownstream tasks. However, there exists semantic confusion between language and\nvision during the pre-training stage. Moreover, current pre-trained models tend\nto take lots of computation resources for fine-tuning when transferred to\ndownstream tasks. In this work, we present a simple but effective approach for\nlearning Contrastive and Adaptive representations of Vision and Language,\nnamely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn\nalignments between the whole sentence and each image in the same batch during\nthe pre-training process. At the fine-tuning stage, we introduce two\nlightweight adaptation networks to reduce model parameters and increase\ntraining speed for saving computation resources. We evaluate our CAVL on six\nmain downstream tasks, including Visual Question Answering (VQA), Visual\nCommonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR),\nRegion-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot\nText-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior\nperformance and reduce the fine-tuning time by a large margin (in particular,\n76.17%). Extensive experiments and ablation studies demonstrate the efficiency\nof contrastive pre-training and adaptive fine-tuning proposed in our CAVL.", "comment": null, "links": []}
{"entry_id": "2304.03318", "title": "Explainable AI And Visual Reasoning: Insights From Radiology", "authors": ["Robert Kaufman", "David Kirsh"], "published": "2023-04-06 18:30:27", "updated": "2023-04-06 18:30:27", "summary": "Why do explainable AI (XAI) explanations in radiology, despite their promise\nof transparency, still fail to gain human trust? Current XAI approaches provide\njustification for predictions, however, these do not meet practitioners' needs.\nThese XAI explanations lack intuitive coverage of the evidentiary basis for a\ngiven classification, posing a significant barrier to adoption. We posit that\nXAI explanations that mirror human processes of reasoning and justification\nwith evidence may be more useful and trustworthy than traditional visual\nexplanations like heat maps. Using a radiology case study, we demonstrate how\nradiology practitioners get other practitioners to see a diagnostic\nconclusion's validity. Machine-learned classifications lack this evidentiary\ngrounding and consequently fail to elicit trust and adoption by potential\nusers. Insights from this study may generalize to guiding principles for\nhuman-centered explanation design based on human reasoning and justification of\nevidence.", "comment": "Accepted to 2023 Conference on Computer-Human Interaction (CHI)\n  Human-Centered Explainable AI Workshop, 8 pages", "links": []}
{"entry_id": "2304.01192", "title": "Navigating to Objects Specified by Images", "authors": ["Jacob Krantz", "Theophile Gervet", "Karmesh Yadav", "Austin Wang", "Chris Paxton", "Roozbeh Mottaghi", "Dhruv Batra", "Jitendra Malik", "Stefan Lee", "Devendra Singh Chaplot"], "published": "2023-04-03 17:58:00", "updated": "2023-04-03 17:58:00", "summary": "Images are a convenient way to specify which particular object instance an\nembodied agent should navigate to. Solving this task requires semantic visual\nreasoning and exploration of unknown environments. We present a system that can\nperform this task in both simulation and the real world. Our modular method\nsolves sub-tasks of exploration, goal instance re-identification, goal\nlocalization, and local navigation. We re-identify the goal instance in\negocentric vision using feature-matching and localize the goal instance by\nprojecting matched features to a map. Each sub-task is solved using\noff-the-shelf components requiring zero fine-tuning. On the HM3D\nInstanceImageNav benchmark, this system outperforms a baseline end-to-end RL\npolicy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We\ndeploy this system to a mobile robot platform and demonstrate effective\nreal-world performance, achieving an 88% success rate across a home and an\noffice environment.", "comment": null, "links": []}
{"entry_id": "2303.15006", "title": "Curriculum Learning for Compositional Visual Reasoning", "authors": ["Wafa Aissa", "Marin Ferecatu", "Michel Crucianu"], "published": "2023-03-27 08:47:18", "updated": "2023-03-27 08:47:18", "summary": "Visual Question Answering (VQA) is a complex task requiring large datasets\nand expensive training. Neural Module Networks (NMN) first translate the\nquestion to a reasoning path, then follow that path to analyze the image and\nprovide an answer. We propose an NMN method that relies on predefined\ncross-modal embeddings to ``warm start'' learning on the GQA dataset, then\nfocus on Curriculum Learning (CL) as a way to improve training and make a\nbetter use of the data. Several difficulty criteria are employed for defining\nCL methods. We show that by an appropriate selection of the CL method the cost\nof training and the amount of training data can be greatly reduced, with a\nlimited impact on the final VQA accuracy. Furthermore, we introduce\nintermediate losses during training and find that this allows to simplify the\nCL strategy.", "comment": null, "links": ["http://dx.doi.org/10.5220/0011895400003417"]}
{"entry_id": "2303.13483", "title": "NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations", "authors": ["Joy Hsu", "Jiayuan Mao", "Jiajun Wu"], "published": "2023-03-23 17:50:40", "updated": "2023-03-23 17:50:40", "summary": "Grounding object properties and relations in 3D scenes is a prerequisite for\na wide range of artificial intelligence tasks, such as visually grounded\ndialogues and embodied manipulation. However, the variability of the 3D domain\ninduces two fundamental challenges: 1) the expense of labeling and 2) the\ncomplexity of 3D grounded language. Hence, essential desiderata for models are\nto be data-efficient, generalize to different data distributions and tasks with\nunseen semantic forms, as well as ground complex language semantics (e.g.,\nview-point anchoring and multi-object reference). To address these challenges,\nwe propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates\nlanguage into programs with hierarchical structures by leveraging large\nlanguage-to-code models. Different functional modules in the programs are\nimplemented as neural networks. Notably, NS3D extends prior neuro-symbolic\nvisual reasoning methods by introducing functional modules that effectively\nreason about high-arity relations (i.e., relations among more than two\nobjects), key in disambiguating objects in complex 3D scenes. Modular and\ncompositional architecture enables NS3D to achieve state-of-the-art results on\nthe ReferIt3D view-dependence task, a 3D referring expression comprehension\nbenchmark. Importantly, NS3D shows significantly improved performance on\nsettings of data-efficiency and generalization, and demonstrate zero-shot\ntransfer to an unseen 3D question-answering task.", "comment": "In CVPR 2023", "links": []}
{"entry_id": "2205.00363", "title": "Visual Spatial Reasoning", "authors": ["Fangyu Liu", "Guy Emerson", "Nigel Collier"], "published": "2022-04-30 23:03:49", "updated": "2023-03-22 15:42:50", "summary": "Spatial relations are a basic part of human cognition. However, they are\nexpressed in natural language in a variety of ways, and previous work has\nsuggested that current vision-and-language models (VLMs) struggle to capture\nrelational information. In this paper, we present Visual Spatial Reasoning\n(VSR), a dataset containing more than 10k natural text-image pairs with 66\ntypes of spatial relations in English (such as: under, in front of, and\nfacing). While using a seemingly simple annotation format, we show how the\ndataset includes challenging linguistic phenomena, such as varying reference\nframes. We demonstrate a large gap between human and model performance: the\nhuman ceiling is above 95%, while state-of-the-art models only achieve around\n70%. We observe that VLMs' by-relation performances have little correlation\nwith the number of training examples and the tested models are in general\nincapable of recognising relations concerning the orientations of objects.", "comment": "TACL camera-ready version; code and data available at\n  https://github.com/cambridgeltl/visual-spatial-reasoning", "links": []}
{"entry_id": "2206.04928", "title": "GAMR: A Guided Attention Model for (visual) Reasoning", "authors": ["Mohit Vaishnav", "Thomas Serre"], "published": "2022-06-10 07:52:06", "updated": "2023-03-21 15:35:50", "summary": "Humans continue to outperform modern AI systems in their ability to flexibly\nparse and understand complex visual scenes. Here, we present a novel module for\nvisual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR),\nwhich instantiates an active vision theory -- positing that the brain solves\ncomplex visual reasoning problems dynamically -- via sequences of attention\nshifts to select and route task-relevant visual information into memory.\nExperiments on an array of visual reasoning tasks and datasets demonstrate\nGAMR's ability to learn visual routines in a robust and sample-efficient\nmanner. In addition, GAMR is shown to be capable of zero-shot generalization on\ncompletely novel reasoning tasks. Overall, our work provides computational\nsupport for cognitive theories that postulate the need for a critical interplay\nbetween attention and memory to dynamically maintain and manipulate\ntask-relevant visual information to solve complex visual reasoning tasks.", "comment": null, "links": []}
{"entry_id": "2303.11730", "title": "Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices", "authors": ["Jingyi Xu", "Tushar Vaidya", "Yufei Wu", "Saket Chandra", "Zhangsheng Lai", "Kai Fong Ernest Chong"], "published": "2023-03-21 10:34:39", "updated": "2023-03-21 10:34:39", "summary": "We introduce algebraic machine reasoning, a new reasoning framework that is\nwell-suited for abstract reasoning. Effectively, algebraic machine reasoning\nreduces the difficult process of novel problem-solving to routine algebraic\ncomputation. The fundamental algebraic objects of interest are the ideals of\nsome suitably initialized polynomial ring. We shall explain how solving Raven's\nProgressive Matrices (RPMs) can be realized as computational problems in\nalgebra, which combine various well-known algebraic subroutines that include:\nComputing the Gr\\\"obner basis of an ideal, checking for ideal containment, etc.\nCrucially, the additional algebraic structure satisfied by ideals allows for\nmore operations on ideals beyond set-theoretic operations.\n  Our algebraic machine reasoning framework is not only able to select the\ncorrect answer from a given answer set, but also able to generate the correct\nanswer with only the question matrix given. Experiments on the I-RAVEN dataset\nyield an overall $93.2\\%$ accuracy, which significantly outperforms the current\nstate-of-the-art accuracy of $77.0\\%$ and exceeds human performance at $84.4\\%$\naccuracy.", "comment": "Accepted at IEEE/CVF Conference on Computer Vision and Pattern\n  Recognition (CVPR) 2023. 30 pages, 7 figures (including supplementary\n  material). First three authors contributed equally. Code is available at:\n  https://github.com/Xu-Jingyi/AlgebraicMR", "links": []}
{"entry_id": "2303.11327", "title": "3D Concept Learning and Reasoning from Multi-View Images", "authors": ["Yining Hong", "Chunru Lin", "Yilun Du", "Zhenfang Chen", "Joshua B. Tenenbaum", "Chuang Gan"], "published": "2023-03-20 17:59:49", "updated": "2023-03-20 17:59:49", "summary": "Humans are able to accurately reason in 3D by gathering multi-view\nobservations of the surrounding world. Inspired by this insight, we introduce a\nnew large-scale benchmark for 3D multi-view visual question answering\n(3DMV-VQA). This dataset is collected by an embodied agent actively moving and\ncapturing RGB images in an environment using the Habitat simulator. In total,\nit consists of approximately 5k scenes, 600k images, paired with 50k questions.\nWe evaluate various state-of-the-art models for visual reasoning on our\nbenchmark and find that they all perform poorly. We suggest that a principled\napproach for 3D reasoning from multi-view images should be to infer a compact\n3D representation of the world from the multi-view images, which is further\ngrounded on open-vocabulary semantic concepts, and then to execute reasoning on\nthese 3D representations. As the first step towards this approach, we propose a\nnovel 3D concept learning and reasoning (3D-CLR) framework that seamlessly\ncombines these components via neural fields, 2D pre-trained vision-language\nmodels, and neural reasoning operators. Experimental results suggest that our\nframework outperforms baseline models by a large margin, but the challenge\nremains largely unsolved. We further perform an in-depth analysis of the\nchallenges and highlight potential future directions.", "comment": "CVPR 2023. Project page: https://vis-www.cs.umass.edu/3d-clr/", "links": []}
{"entry_id": "2303.10482", "title": "Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning", "authors": ["Shi Chen", "Qi Zhao"], "published": "2023-03-18 19:37:28", "updated": "2023-03-18 19:37:28", "summary": "Humans have the innate capability to answer diverse questions, which is\nrooted in the natural ability to correlate different concepts based on their\nsemantic relationships and decompose difficult problems into sub-tasks. On the\ncontrary, existing visual reasoning methods assume training samples that\ncapture every possible object and reasoning problem, and rely on black-boxed\nmodels that commonly exploit statistical priors. They have yet to develop the\ncapability to address novel objects or spurious biases in real-world scenarios,\nand also fall short of interpreting the rationales behind their decisions.\nInspired by humans' reasoning of the visual world, we tackle the aforementioned\nchallenges from a compositional perspective, and propose an integral framework\nconsisting of a principled object factorization method and a novel neural\nmodule network. Our factorization method decomposes objects based on their key\ncharacteristics, and automatically derives prototypes that represent a wide\nrange of objects. With these prototypes encoding important semantics, the\nproposed network then correlates objects by measuring their similarity on a\ncommon semantic space and makes decisions with a compositional reasoning\nprocess. It is capable of answering questions with diverse objects regardless\nof their availability during training, and overcoming the issues of biased\nquestion-answer distributions. In addition to the enhanced generalizability,\nour framework also provides an interpretable interface for understanding the\ndecision-making process of models. Our code is available at\nhttps://github.com/szzexpoi/POEM.", "comment": "To appear in CVPR 2023", "links": []}
{"entry_id": "2303.01046", "title": "Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos", "authors": ["Daizong Liu", "Pan Zhou"], "published": "2023-03-02 08:00:22", "updated": "2023-03-15 03:10:39", "summary": "Temporal sentence localization in videos (TSLV) aims to retrieve the most\ninterested segment in an untrimmed video according to a given sentence query.\nHowever, almost of existing TSLV approaches suffer from the same limitations:\n(1) They only focus on either frame-level or object-level visual representation\nlearning and corresponding correlation reasoning, but fail to integrate them\nboth; (2) They neglect to leverage the rich semantic contexts to further\nbenefit the query reasoning. To address these issues, in this paper, we propose\na novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN),\nwhich enables both visual- and semantic-aware query reasoning from object-level\nto frame-level. Specifically, we present a new graph memory mechanism to\nperform visual-semantic query reasoning: For visual reasoning, we design a\nvisual graph memory to leverage visual information of video; For semantic\nreasoning, a semantic graph memory is also introduced to explicitly leverage\nsemantic knowledge contained in the classes and attributes of video objects,\nand perform correlation reasoning in the semantic space. Experiments on three\ndatasets demonstrate that our HVSARN achieves a new state-of-the-art\nperformance.", "comment": "Accepted by ICASSP2023", "links": []}
{"entry_id": "2303.05983", "title": "New Benchmarks for Accountable Text-based Visual Re-creation", "authors": ["Zhiwei Zhang", "Yuliang Liu"], "published": "2023-03-10 15:35:11", "updated": "2023-03-10 15:35:11", "summary": "Given a command, humans can directly execute the action after thinking or\nchoose to reject it, with reasonable feedback at the same time. However, the\nbehavior of existing text-to-image generation methods are uncontrollable and\nirresponsible. In this paper, we construct extensive experiments to verify\nwhether they can be accountable (say no and explain why) for those prohibited\ninstructions. To this end, we define a novel text-based visual re-creation task\nand construct new synthetic CLEVR-NOT dataset (620K) and manually pictured\nFruit-NOT dataset (50K). In our method, one text-image pair as the query is fed\ninto the machine, and the model gives a yes or no answer after visual and\ntextual reasoning. If the answer is yes, the image auto-encoder and\nauto-regressive transformer must complete the visual re-creation under the\npremise of ensuring image quality, otherwise the system needs to explain why\nthe commands cannot be completed or prohibited. We provide a detailed analysis\nof experimental results in image quality, answer accuracy, and model behavior\nin the face of uncertainty and imperfect user queries. Our results demonstrate\nthe difficulty of a single model for both textual and visual reasoning. We also\nhope our explorations and findings can bring valuable insights about the\naccountability of text-based image generation models. Code and datasets can be\nfound at https://matrix-alpha.github.io.", "comment": "13 pages, 9 figures", "links": []}
{"entry_id": "2303.05952", "title": "Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning", "authors": ["Qian Jiang", "Changyou Chen", "Han Zhao", "Liqun Chen", "Qing Ping", "Son Dinh Tran", "Yi Xu", "Belinda Zeng", "Trishul Chilimbi"], "published": "2023-03-10 14:38:49", "updated": "2023-03-10 14:38:49", "summary": "Contrastive loss has been increasingly used in learning representations from\nmultiple modalities. In the limit, the nature of the contrastive loss\nencourages modalities to exactly match each other in the latent space. Yet it\nremains an open question how the modality alignment affects the downstream task\nperformance. In this paper, based on an information-theoretic argument, we\nfirst prove that exact modality alignment is sub-optimal in general for\ndownstream prediction tasks. Hence we advocate that the key of better\nperformance lies in meaningful latent modality structures instead of perfect\nmodality alignment. To this end, we propose three general approaches to\nconstruct latent modality structures. Specifically, we design 1) a deep feature\nseparation loss for intra-modality regularization; 2) a Brownian-bridge loss\nfor inter-modality regularization; and 3) a geometric consistency loss for both\nintra- and inter-modality regularization. Extensive experiments are conducted\non two popular multi-modal representation learning frameworks: the CLIP-based\ntwo-tower model and the ALBEF-based fusion model. We test our model on a\nvariety of tasks including zero/few-shot image classification, image-text\nretrieval, visual question answering, visual reasoning, and visual entailment.\nOur method achieves consistent improvements over existing methods,\ndemonstrating the effectiveness and generalizability of our proposed approach\non latent modality structure regularization.", "comment": "14 pages, 8 figure, CVPR 2023 accepted", "links": []}
{"entry_id": "2303.02814", "title": "Visual Analytics of Neuron Vulnerability to Adversarial Attacks on Convolutional Neural Networks", "authors": ["Yiran Li", "Junpeng Wang", "Takanori Fujiwara", "Kwan-Liu Ma"], "published": "2023-03-06 01:01:56", "updated": "2023-03-06 01:01:56", "summary": "Adversarial attacks on a convolutional neural network (CNN) -- injecting\nhuman-imperceptible perturbations into an input image -- could fool a\nhigh-performance CNN into making incorrect predictions. The success of\nadversarial attacks raises serious concerns about the robustness of CNNs, and\nprevents them from being used in safety-critical applications, such as medical\ndiagnosis and autonomous driving. Our work introduces a visual analytics\napproach to understanding adversarial attacks by answering two questions: (1)\nwhich neurons are more vulnerable to attacks and (2) which image features do\nthese vulnerable neurons capture during the prediction? For the first question,\nwe introduce multiple perturbation-based measures to break down the attacking\nmagnitude into individual CNN neurons and rank the neurons by their\nvulnerability levels. For the second, we identify image features (e.g., cat\nears) that highly stimulate a user-selected neuron to augment and validate the\nneuron's responsibility. Furthermore, we support an interactive exploration of\na large number of neurons by aiding with hierarchical clustering based on the\nneurons' roles in the prediction. To this end, a visual analytics system is\ndesigned to incorporate visual reasoning for interpreting adversarial attacks.\nWe validate the effectiveness of our system through multiple case studies as\nwell as feedback from domain experts.", "comment": "Accepted by the Special Issue on Human-Centered Explainable AI, ACM\n  Transactions on Interactive Intelligent Systems", "links": ["http://dx.doi.org/10.1145/3587470"]}
{"entry_id": "2209.09115", "title": "Compositional Law Parsing with Latent Random Functions", "authors": ["Fan Shi", "Bin Li", "Xiangyang Xue"], "published": "2022-09-15 06:57:23", "updated": "2023-02-25 08:26:16", "summary": "Human cognition has compositionality. We understand a scene by decomposing\nthe scene into different concepts (e.g., shape and position of an object) and\nlearning the respective laws of these concepts, which may be either natural\n(e.g., laws of motion) or man-made (e.g., laws of a game). The automatic\nparsing of these laws indicates the model's ability to understand the scene,\nwhich makes law parsing play a central role in many visual tasks. This paper\nproposes a deep latent variable model for Compositional LAw Parsing (CLAP),\nwhich achieves the human-like compositionality ability through an\nencoding-decoding architecture to represent concepts in the scene as latent\nvariables. CLAP employs concept-specific latent random functions instantiated\nwith Neural Processes to capture the law of concepts. Our experimental results\ndemonstrate that CLAP outperforms the baseline methods in multiple visual tasks\nsuch as intuitive physics, abstract visual reasoning, and scene representation.\nThe law manipulation experiments illustrate CLAP's interpretability by\nmodifying specific latent random functions on samples. For example, CLAP learns\nthe laws of position-changing and appearance constancy from the moving balls in\na scene, making it possible to exchange laws between samples or compose\nexisting laws into novel laws.", "comment": null, "links": []}
{"entry_id": "2302.02117", "title": "Learning to Agree on Vision Attention for Visual Commonsense Reasoning", "authors": ["Zhenyang Li", "Yangyang Guo", "Kejie Wang", "Fan Liu", "Liqiang Nie", "Mohan Kankanhalli"], "published": "2023-02-04 07:02:29", "updated": "2023-02-19 06:44:39", "summary": "Visual Commonsense Reasoning (VCR) remains a significant yet challenging\nresearch problem in the realm of visual reasoning. A VCR model generally aims\nat answering a textual question regarding an image, followed by the rationale\nprediction for the preceding answering process. Though these two processes are\nsequential and intertwined, existing methods always consider them as two\nindependent matching-based instances. They, therefore, ignore the pivotal\nrelationship between the two processes, leading to sub-optimal model\nperformance. This paper presents a novel visual attention alignment method to\nefficaciously handle these two processes in a unified framework. To achieve\nthis, we first design a re-attention module for aggregating the vision\nattention map produced in each process. Thereafter, the resultant two sets of\nattention maps are carefully aligned to guide the two processes to make\ndecisions based on the same image regions. We apply this method to both\nconventional attention and the recent Transformer models and carry out\nextensive experiments on the VCR benchmark dataset. The results demonstrate\nthat with the attention alignment module, our method achieves a considerable\nimprovement over the baseline methods, evidently revealing the feasibility of\nthe coupling of the two processes as well as the effectiveness of the proposed\nmethod.", "comment": null, "links": []}
{"entry_id": "2103.12045", "title": "Raven's Progressive Matrices Completion with Latent Gaussian Process Priors", "authors": ["Fan Shi", "Bin Li", "Xiangyang Xue"], "published": "2021-03-22 17:48:44", "updated": "2023-02-17 13:57:22", "summary": "Abstract reasoning ability is fundamental to human intelligence. It enables\nhumans to uncover relations among abstract concepts and further deduce implicit\nrules from the relations. As a well-known abstract visual reasoning task,\nRaven's Progressive Matrices (RPM) are widely used in human IQ tests. Although\nextensive research has been conducted on RPM solvers with machine intelligence,\nfew studies have considered further advancing the standard answer-selection\n(classification) problem to a more challenging answer-painting (generating)\nproblem, which can verify whether the model has indeed understood the implicit\nrules. In this paper we aim to solve the latter one by proposing a deep latent\nvariable model, in which multiple Gaussian processes are employed as priors of\nlatent variables to separately learn underlying abstract concepts from RPMs;\nthus the proposed model is interpretable in terms of concept-specific latent\nvariables. The latent Gaussian process also provides an effective way of\nextrapolation for answer painting based on the learned concept-changing rules.\nWe evaluate the proposed model on RPM-like datasets with multiple\ncontinuously-changing visual concepts. Experimental results demonstrate that\nour model requires only few training samples to paint high-quality answers,\ngenerate novel RPM panels, and achieve interpretability through\nconcept-specific latent variables.", "comment": null, "links": []}
{"entry_id": "2302.05608", "title": "Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis", "authors": ["Zhu Wang", "Sourav Medya", "Sathya N. Ravi"], "published": "2023-02-11 05:46:21", "updated": "2023-02-11 05:46:21", "summary": "Often, deep network models are purely inductive during training and while\nperforming inference on unseen data. Thus, when such models are used for\npredictions, it is well known that they often fail to capture the semantic\ninformation and implicit dependencies that exist among objects (or concepts) on\na population level. Moreover, it is still unclear how domain or prior modal\nknowledge can be specified in a backpropagation friendly manner, especially in\nlarge-scale and noisy settings. In this work, we propose an end-to-end vision\nand language model incorporating explicit knowledge graphs. We also introduce\nan interactive out-of-distribution (OOD) layer using implicit network operator.\nThe layer is used to filter noise that is brought by external knowledge base.\nIn practice, we apply our model on several vision and language downstream tasks\nincluding visual question answering, visual reasoning, and image-text retrieval\non different datasets. Our experiments show that it is possible to design\nmodels that perform similarly to state-of-art results but with significantly\nfewer samples and training time.", "comment": null, "links": []}
{"entry_id": "2302.07137", "title": "Deep Non-Monotonic Reasoning for Visual Abstract Reasoning Tasks", "authors": ["Yuan Yang", "Deepayan Sanyal", "Joel Michelson", "James Ainooson", "Maithilee Kunda"], "published": "2023-02-08 16:35:05", "updated": "2023-02-08 16:35:05", "summary": "While achieving unmatched performance on many well-defined tasks, deep\nlearning models have also been used to solve visual abstract reasoning tasks,\nwhich are relatively less well-defined, and have been widely used to measure\nhuman intelligence. However, current deep models struggle to match human\nabilities to solve such tasks with minimum data but maximum generalization. One\nlimitation is that current deep learning models work in a monotonic way, i.e.,\ntreating different parts of the input in essentially fixed orderings, whereas\npeople repeatedly observe and reason about the different parts of the visual\nstimuli until the reasoning process converges to a consistent conclusion, i.e.,\nnon-monotonic reasoning. This paper proposes a non-monotonic computational\napproach to solve visual abstract reasoning tasks. In particular, we\nimplemented a deep learning model using this approach and tested it on the\nRAVEN dataset -- a dataset inspired by the Raven's Progressive Matrices test.\nResults show that the proposed approach is more effective than existing\nmonotonic deep learning models, under strict experimental settings that\nrepresent a difficult variant of the RAVEN dataset problem.", "comment": null, "links": []}
{"entry_id": "2301.13741", "title": "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", "authors": ["Dachuan Shi", "Chaofan Tao", "Ying Jin", "Zhendong Yang", "Chun Yuan", "Jiaqi Wang"], "published": "2023-01-31 16:18:52", "updated": "2023-01-31 16:18:52", "summary": "Real-world data contains a vast amount of multimodal information, among which\nvision and language are the two most representative modalities. Moreover,\nincreasingly heavier models, e.g., Transformers, have attracted the attention\nof researchers to model compression. However, how to compress multimodal\nmodels, especially vison-language Transformers, is still under-explored. This\npaper proposes the \\textbf{U}nified and \\textbf{P}r\\textbf{o}gressive\n\\textbf{P}runing (UPop) as a universal vison-language Transformer compression\nframework, which incorporates 1) unifiedly searching multimodal subnets in a\ncontinuous optimization space from the original model, which enables automatic\nassignment of pruning ratios among compressible modalities and structures; 2)\nprogressively searching and retraining the subnet, which maintains convergence\nbetween the search and retrain to attain higher compression ratios. Experiments\non multiple generative and discriminative vision-language tasks, including\nVisual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval,\nText-Image Retrieval, and Image Classification, demonstrate the effectiveness\nand versatility of the proposed UPop framework.", "comment": "16 pages, 5 figures, 13 tables", "links": []}
{"entry_id": "2301.05226", "title": "See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning", "authors": ["Zhenfang Chen", "Qinhong Zhou", "Yikang Shen", "Yining Hong", "Hao Zhang", "Chuang Gan"], "published": "2023-01-12 18:59:50", "updated": "2023-01-12 18:59:50", "summary": "Large pre-trained vision and language models have demonstrated remarkable\ncapacities for various tasks. However, solving the knowledge-based visual\nreasoning tasks remains challenging, which requires a model to comprehensively\nunderstand image content, connect the external world knowledge, and perform\nstep-by-step reasoning to answer the questions correctly. To this end, we\npropose a novel framework named Interactive Prompting Visual Reasoner (IPVR)\nfor few-shot knowledge-based visual reasoning. IPVR contains three stages, see,\nthink and confirm. The see stage scans the image and grounds the visual concept\ncandidates with a visual perception model. The think stage adopts a pre-trained\nlarge language model (LLM) to attend to the key concepts from candidates\nadaptively. It then transforms them into text context for prompting with a\nvisual captioning model and adopts the LLM to generate the answer. The confirm\nstage further uses the LLM to generate the supporting rationale to the answer,\nverify the generated rationale with a cross-modality classifier and ensure that\nthe rationale can infer the predicted output consistently. We conduct\nexperiments on a range of knowledge-based visual reasoning datasets. We found\nour IPVR enjoys several benefits, 1). it achieves better performance than the\nprevious few-shot learning baselines; 2). it enjoys the total transparency and\ntrustworthiness of the whole reasoning process by providing rationales for each\nreasoning step; 3). it is computation-efficient compared with other fine-tuning\nbaselines.", "comment": "The first two authors contributed equally to this work", "links": []}
{"entry_id": "2202.12626", "title": "Joint Answering and Explanation for Visual Commonsense Reasoning", "authors": ["Zhenyang Li", "Yangyang Guo", "Kejie Wang", "Yinwei Wei", "Liqiang Nie", "Mohan Kankanhalli"], "published": "2022-02-25 11:26:52", "updated": "2023-01-12 13:47:43", "summary": "Visual Commonsense Reasoning (VCR), deemed as one challenging extension of\nthe Visual Question Answering (VQA), endeavors to pursue a more high-level\nvisual comprehension. It is composed of two indispensable processes: question\nanswering over a given image and rationale inference for answer explanation.\nOver the years, a variety of methods tackling VCR have advanced the performance\non the benchmark dataset. Despite significant as these methods are, they often\ntreat the two processes in a separate manner and hence decompose the VCR into\ntwo irrelevant VQA instances. As a result, the pivotal connection between\nquestion answering and rationale inference is interrupted, rendering existing\nefforts less faithful on visual reasoning. To empirically study this issue, we\nperform some in-depth explorations in terms of both language shortcuts and\ngeneralization capability to verify the pitfalls of this treatment. Based on\nour findings, in this paper, we present a plug-and-play knowledge distillation\nenhanced framework to couple the question answering and rationale inference\nprocesses. The key contribution is the introduction of a novel branch, which\nserves as the bridge to conduct processes connecting. Given that our framework\nis model-agnostic, we apply it to the existing popular baselines and validate\nits effectiveness on the benchmark dataset. As detailed in the experimental\nresults, when equipped with our framework, these baselines achieve consistent\nand significant performance improvements, demonstrating the viability of\nprocesses coupling, as well as the superiority of the proposed framework.", "comment": null, "links": []}
{"entry_id": "2206.08358", "title": "MixGen: A New Multi-Modal Data Augmentation", "authors": ["Xiaoshuai Hao", "Yi Zhu", "Srikar Appalaraju", "Aston Zhang", "Wanqian Zhang", "Bo Li", "Mu Li"], "published": "2022-06-16 17:58:09", "updated": "2023-01-09 22:26:06", "summary": "Data augmentation is a necessity to enhance data efficiency in deep learning.\nFor vision-language pre-training, data is only augmented either for images or\nfor text in previous works. In this paper, we present MixGen: a joint data\naugmentation for vision-language representation learning to further improve\ndata efficiency. It generates new image-text pairs with semantic relationships\npreserved by interpolating images and concatenating text. It's simple, and can\nbe plug-and-played into existing pipelines. We evaluate MixGen on four\narchitectures, including CLIP, ViLT, ALBEF and TCL, across five downstream\nvision-language tasks to show its versatility and effectiveness. For example,\nadding MixGen in ALBEF pre-training leads to absolute performance improvements\non downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3%\non Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual\nreasoning (+$0.9% on NLVR2), visual question answering (+0.3% on VQA2.0), and\nvisual entailment (+0.4% on SNLI-VE).", "comment": "First three authors contributed equally. Code are available at\n  https://github.com/amazon-research/mix-generation. Oral presentation at WACV\n  2023 Pretraining Large Vision and Multimodal Models Workshop", "links": []}
{"entry_id": "2301.03094", "title": "A Divide-Align-Conquer Strategy for Program Synthesis", "authors": ["Jonas Witt", "Stef Rasing", "Sebastijan Dumančić", "Tias Guns", "Claus-Christian Carbon"], "published": "2023-01-08 19:10:55", "updated": "2023-01-08 19:10:55", "summary": "A major bottleneck in search-based program synthesis is the exponentially\ngrowing search space which makes learning large programs intractable. Humans\nmitigate this problem by leveraging the compositional nature of the real world:\nIn structured domains, a logical specification can often be decomposed into\nsmaller, complementary solution programs. We show that compositional\nsegmentation can be applied in the programming by examples setting to divide\nthe search for large programs across multiple smaller program synthesis\nproblems. For each example, we search for a decomposition into smaller units\nwhich maximizes the reconstruction accuracy in the output under a latent task\nprogram. A structural alignment of the constituent parts in the input and\noutput leads to pairwise correspondences used to guide the program synthesis\nsearch. In order to align the input/output structures, we make use of the\nStructure-Mapping Theory (SMT), a formal model of human analogical reasoning\nwhich originated in the cognitive sciences. We show that decomposition-driven\nprogram synthesis with structural alignment outperforms Inductive Logic\nProgramming (ILP) baselines on string transformation tasks even with minimal\nknowledge priors. Unlike existing methods, the predictive accuracy of our agent\nmonotonically increases for additional examples and achieves an average time\ncomplexity of $\\mathcal{O}(m)$ in the number $m$ of partial programs for highly\nstructured domains such as strings. We extend this method to the complex\nsetting of visual reasoning in the Abstraction and Reasoning Corpus (ARC) for\nwhich ILP methods were previously infeasible.", "comment": "11 pages, 9 figures", "links": []}
{"entry_id": "2201.05729", "title": "CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks", "authors": ["Zhecan Wang", "Noel Codella", "Yen-Chun Chen", "Luowei Zhou", "Jianwei Yang", "Xiyang Dai", "Bin Xiao", "Haoxuan You", "Shih-Fu Chang", "Lu Yuan"], "published": "2022-01-15 01:54:01", "updated": "2022-12-28 20:07:58", "summary": "Contrastive language-image pretraining (CLIP) links vision and language\nmodalities into a unified embedding space, yielding the tremendous potential\nfor vision-language (VL) tasks. While early concurrent works have begun to\nstudy this potential on a subset of tasks, important questions remain: 1) What\nis the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in\nlow-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches\nwithout impacting inference or pretraining complexity? In this work, we seek to\nanswer these questions through two key contributions. First, we introduce an\nevaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual\nEntailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of\ndata availability constraints and conditions of domain shift. Second, we\npropose an approach, named CLIP Targeted Distillation (CLIP-TD), to\nintelligently distill knowledge from CLIP into existing architectures using a\ndynamically weighted objective applied to adaptively selected tokens per\ninstance. Experiments demonstrate that our proposed CLIP-TD leads to\nexceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to\n71.3%) conditions of VCR, while simultaneously improving performance under\nstandard fully-supervised conditions (up to 2%), achieving state-of-art\nperformance on VCR compared to other single models that are pretrained with\nimage-text data only. On SNLI-VE, CLIP-TD produces significant gains in\nlow-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On\nVQA, CLIP-TD provides improvement in low-shot (up to 9%), and in\nfully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works\nutilizing CLIP for finetuning, as well as baseline naive distillation\napproaches. Code will be made available.", "comment": "This paper is greatly modified and updated to be re-submitted to\n  another conference. The new paper is under the name \"Multimodal Adaptive\n  Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks\",\n  https://doi.org/10.48550/arXiv.2204.10496", "links": []}
{"entry_id": "2301.13007", "title": "EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry", "authors": ["Man Fai Wong", "Xintong Qi", "Chee Wei Tan"], "published": "2022-12-27 18:32:40", "updated": "2022-12-27 18:32:40", "summary": "In this paper, we present a deep learning-based framework for solving\ngeometric construction problems through visual reasoning, which is useful for\nautomated geometry theorem proving. Constructible problems in geometry often\nask for the sequence of straightedge-and-compass constructions to construct a\ngiven goal given some initial setup. Our EuclidNet framework leverages the\nneural network architecture Mask R-CNN to extract the visual features from the\ninitial setup and goal configuration with extra points of intersection, and\nthen generate possible construction steps as intermediary data models that are\nused as feedback in the training process for further refinement of the\nconstruction step sequence. This process is repeated recursively until either a\nsolution is found, in which case we backtrack the path for a step-by-step\nconstruction guide, or the problem is identified as unsolvable. Our EuclidNet\nframework is validated on complex Japanese Sangaku geometry problems,\ndemonstrating its capacity to leverage backtracking for deep visual reasoning\nof challenging problems.", "comment": "Accepted by 2nd MATH-AI Workshop at NeurIPS'22", "links": []}
{"entry_id": "2212.13296", "title": "VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges", "authors": ["Rufai Yusuf Zakari", "Jim Wilson Owusu", "Hailin Wang", "Ke Qin", "Zaharaddeen Karami Lawal", "Yuezhou Dong"], "published": "2022-12-26 20:56:01", "updated": "2022-12-26 20:56:01", "summary": "Artificial Intelligence (AI) and its applications have sparked extraordinary\ninterest in recent years. This achievement can be ascribed in part to advances\nin AI subfields including Machine Learning (ML), Computer Vision (CV), and\nNatural Language Processing (NLP). Deep learning, a sub-field of machine\nlearning that employs artificial neural network concepts, has enabled the most\nrapid growth in these domains. The integration of vision and language has\nsparked a lot of attention as a result of this. The tasks have been created in\nsuch a way that they properly exemplify the concepts of deep learning. In this\nreview paper, we provide a thorough and an extensive review of the state of the\narts approaches, key models design principles and discuss existing datasets,\nmethods, their problem formulation and evaluation measures for VQA and Visual\nreasoning tasks to understand vision and language representation learning. We\nalso present some potential future paths in this field of research, with the\nhope that our study may generate new ideas and novel approaches to handle\nexisting difficulties and develop new applications.", "comment": null, "links": []}
{"entry_id": "2212.10292", "title": "Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?", "authors": ["Monika Wysoczańska", "Tom Monnier", "Tomasz Trzciński", "David Picard"], "published": "2022-12-20 14:36:45", "updated": "2022-12-20 14:36:45", "summary": "Recent advances in visual representation learning allowed to build an\nabundance of powerful off-the-shelf features that are ready-to-use for numerous\ndownstream tasks. This work aims to assess how well these features preserve\ninformation about the objects, such as their spatial location, their visual\nproperties and their relative relationships. We propose to do so by evaluating\nthem in the context of visual reasoning, where multiple objects with complex\nrelationships and different attributes are at play. More specifically, we\nintroduce a protocol to evaluate visual representations for the task of Visual\nQuestion Answering. In order to decouple visual feature extraction from\nreasoning, we design a specific attention-based reasoning module which is\ntrained on the frozen visual representations to be evaluated, in a spirit\nsimilar to standard feature evaluations relying on shallow networks. We compare\ntwo types of visual representations, densely extracted local features and\nobject-centric ones, against the performances of a perfect image representation\nusing ground truth. Our main findings are two-fold. First, despite excellent\nperformances on classical proxy tasks, such representations fall short for\nsolving complex reasoning problem. Second, object-centric features better\npreserve the critical information necessary to perform visual reasoning. In our\nproposed framework we show how to methodologically approach this evaluation.", "comment": null, "links": []}
{"entry_id": "2212.09522", "title": "MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering", "authors": ["Difei Gao", "Luowei Zhou", "Lei Ji", "Linchao Zhu", "Yi Yang", "Mike Zheng Shou"], "published": "2022-12-19 15:05:40", "updated": "2022-12-19 15:05:40", "summary": "To build Video Question Answering (VideoQA) systems capable of assisting\nhumans in daily activities, seeking answers from long-form videos with diverse\nand complex events is a must. Existing multi-modal VQA models achieve promising\nperformance on images or short video clips, especially with the recent success\nof large-scale multi-modal pre-training. However, when extending these methods\nto long-form videos, new challenges arise. On the one hand, using a dense video\nsampling strategy is computationally prohibitive. On the other hand, methods\nrelying on sparse sampling struggle in scenarios where multi-event and\nmulti-granularity visual reasoning are required. In this work, we introduce a\nnew model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to\nbetter adapt pre-trained models for long-form VideoQA. Specifically, MIST\ndecomposes traditional dense spatial-temporal self-attention into cascaded\nsegment and region selection modules that adaptively select frames and image\nregions that are closely relevant to the question itself. Visual concepts at\ndifferent granularities are then processed efficiently through an attention\nmodule. In addition, MIST iteratively conducts selection and attention over\nmultiple layers to support reasoning over multiple events. The experimental\nresults on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA,\nshow that MIST achieves state-of-the-art performance and is superior at\ncomputation efficiency and interpretability.", "comment": null, "links": []}
{"entry_id": "2212.01639", "title": "Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests", "authors": ["Christopher Beckham", "Martin Weiss", "Florian Golemo", "Sina Honari", "Derek Nowrouzezahrai", "Christopher Pal"], "published": "2022-12-03 16:02:48", "updated": "2022-12-03 16:02:48", "summary": "Different types of mental rotation tests have been used extensively in\npsychology to understand human visual reasoning and perception. Understanding\nwhat an object or visual scene would look like from another viewpoint is a\nchallenging problem that is made even harder if it must be performed from a\nsingle image. We explore a controlled setting whereby questions are posed about\nthe properties of a scene if that scene was observed from another viewpoint. To\ndo this we have created a new version of the CLEVR dataset that we call CLEVR\nMental Rotation Tests (CLEVR-MRT). Using CLEVR-MRT we examine standard methods,\nshow how they fall short, then explore novel neural architectures that involve\ninferring volumetric representations of a scene. These volumes can be\nmanipulated via camera-conditioned transformations to answer the question. We\nexamine the efficacy of different model variants through rigorous ablations and\ndemonstrate the efficacy of volumetric representations.", "comment": "Accepted for publication to Pattern Recognition journal", "links": []}
{"entry_id": "2205.00949", "title": "Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering", "authors": ["AJ Piergiovanni", "Wei Li", "Weicheng Kuo", "Mohammad Saffar", "Fred Bertsch", "Anelia Angelova"], "published": "2022-05-02 14:53:13", "updated": "2022-11-30 21:57:52", "summary": "We present Answer-Me, a task-aware multi-task framework which unifies a\nvariety of question answering tasks, such as, visual question answering, visual\nentailment, visual reasoning. In contrast to previous works using contrastive\nor generative captioning training, we propose a novel and simple recipe to\npre-train a vision-language joint model, which is multi-task as well. The\npre-training uses only noisy image captioning data, and is formulated to use\nthe entire architecture end-to-end with both a strong language encoder and\ndecoder. Our results show state-of-the-art performance, zero-shot\ngeneralization, robustness to forgetting, and competitive single-task results\nacross a variety of question answering tasks. Our multi-task mixture training\nlearns from tasks of various question intents and thus generalizes better,\nincluding on zero-shot vision-language tasks. We conduct experiments in the\nchallenging multi-task and open-vocabulary settings and across a variety of\ndatasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the\nproposed approach is able to generalize to unseen tasks and that more diverse\nmixtures lead to higher accuracy in both known and novel tasks.", "comment": null, "links": []}
{"entry_id": "2211.16492", "title": "Abstract Visual Reasoning with Tangram Shapes", "authors": ["Anya Ji", "Noriyuki Kojima", "Noah Rush", "Alane Suhr", "Wai Keen Vong", "Robert D. Hawkins", "Yoav Artzi"], "published": "2022-11-29 18:57:06", "updated": "2022-11-29 18:57:06", "summary": "We introduce KiloGram, a resource for studying abstract visual reasoning in\nhumans and machines. Drawing on the history of tangram puzzles as stimuli in\ncognitive science, we build a richly annotated dataset that, with >1k distinct\nstimuli, is orders of magnitude larger and more diverse than prior resources.\nIt is both visually and linguistically richer, moving beyond whole shape\ndescriptions to include segmentation maps and part labels. We use this resource\nto evaluate the abstract visual reasoning capacities of recent multi-modal\nmodels. We observe that pre-trained weights demonstrate limited abstract\nreasoning, which dramatically improves with fine-tuning. We also observe that\nexplicitly describing parts aids abstract reasoning for both humans and models,\nespecially when jointly encoding the linguistic and visual inputs. KiloGram is\navailable at https://lil.nlp.cornell.edu/kilogram .", "comment": "EMNLP 2022 long paper", "links": []}
{"entry_id": "2211.15402", "title": "Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation", "authors": ["Jiangyong Huang", "William Yicheng Zhu", "Baoxiong Jia", "Zan Wang", "Xiaojian Ma", "Qing Li", "Siyuan Huang"], "published": "2022-11-28 15:06:07", "updated": "2022-11-28 15:06:07", "summary": "Current computer vision models, unlike the human visual system, cannot yet\nachieve general-purpose visual understanding. Existing efforts to create a\ngeneral vision model are limited in the scope of assessed tasks and offer no\noverarching framework to perform them holistically. We present a new\ncomprehensive benchmark, General-purpose Visual Understanding Evaluation\n(G-VUE), covering the full spectrum of visual cognitive abilities with four\nfunctional domains $\\unicode{x2014}$ Perceive, Ground, Reason, and Act. The\nfour domains are embodied in 11 carefully curated tasks, from 3D reconstruction\nto visual reasoning and manipulation. Along with the benchmark, we provide a\ngeneral encoder-decoder framework to allow for the evaluation of arbitrary\nvisual representation on all 11 tasks. We evaluate various pre-trained visual\nrepresentations with our framework and observe that (1) Transformer-based\nvisual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual\nrepresentations from vision-language pre-training are superior to those with\nvision-only pre-training across visual tasks. With G-VUE, we provide a holistic\nevaluation standard to motivate research toward building general-purpose visual\nsystems via obtaining more general-purpose visual representations.", "comment": null, "links": []}
{"entry_id": "2205.11169", "title": "PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models", "authors": ["Yuan Yao", "Qianyu Chen", "Ao Zhang", "Wei Ji", "Zhiyuan Liu", "Tat-Seng Chua", "Maosong Sun"], "published": "2022-05-23 10:17:53", "updated": "2022-11-22 06:59:30", "summary": "Vision-language pre-training (VLP) has shown impressive performance on a wide\nrange of cross-modal tasks, where VLP models without reliance on object\ndetectors are becoming the mainstream due to their superior computation\nefficiency and competitive performance. However, the removal of object\ndetectors also deprives the capability of VLP models in explicit object\nmodeling, which is essential to various position-sensitive vision-language (VL)\ntasks, such as referring expression comprehension and visual commonsense\nreasoning. To address the challenge, we introduce PEVL that enhances the\npre-training and prompt tuning of VLP models with explicit object position\nmodeling. Specifically, PEVL reformulates discretized object positions and\nlanguage in a unified language modeling framework, which facilitates explicit\nVL alignment during pre-training, and also enables flexible prompt tuning for\nvarious downstream tasks. We show that PEVL enables state-of-the-art\nperformance of detector-free VLP models on position-sensitive tasks such as\nreferring expression comprehension and phrase grounding, and also improves the\nperformance on position-insensitive tasks with grounded inputs. We make the\ndata and code for this paper publicly available at\nhttps://github.com/thunlp/PEVL.", "comment": "Accepted by EMNLP 2022", "links": []}
{"entry_id": "2211.11153", "title": "Unifying Vision-Language Representation Space with Single-tower Transformer", "authors": ["Jiho Jang", "Chaerin Kong", "Donghyeon Jeon", "Seonhoon Kim", "Nojun Kwak"], "published": "2022-11-21 02:34:21", "updated": "2022-11-21 02:34:21", "summary": "Contrastive learning is a form of distance learning that aims to learn\ninvariant features from two related representations. In this paper, we explore\nthe bold hypothesis that an image and its caption can be simply regarded as two\ndifferent views of the underlying mutual information, and train a model to\nlearn a unified vision-language representation space that encodes both\nmodalities at once in a modality-agnostic manner. We first identify\ndifficulties in learning a generic one-tower model for vision-language\npretraining (VLP), and propose OneR as a simple yet effective framework for our\ngoal. We discover intriguing properties that distinguish OneR from the previous\nworks that learn modality-specific representation spaces such as zero-shot\nobject localization, text-guided visual reasoning and multi-modal retrieval,\nand present analyses to provide insights into this new form of multi-modal\nrepresentation learning. Thorough evaluations demonstrate the potential of a\nunified modality-agnostic VLP framework.", "comment": "AAAI 2023, 11 pages", "links": []}
{"entry_id": "2211.11559", "title": "Visual Programming: Compositional visual reasoning without training", "authors": ["Tanmay Gupta", "Aniruddha Kembhavi"], "published": "2022-11-18 18:50:09", "updated": "2022-11-18 18:50:09", "summary": "We present VISPROG, a neuro-symbolic approach to solving complex and\ncompositional visual tasks given natural language instructions. VISPROG avoids\nthe need for any task-specific training. Instead, it uses the in-context\nlearning ability of large language models to generate python-like modular\nprograms, which are then executed to get both the solution and a comprehensive\nand interpretable rationale. Each line of the generated program may invoke one\nof several off-the-shelf computer vision models, image processing routines, or\npython functions to produce intermediate outputs that may be consumed by\nsubsequent parts of the program. We demonstrate the flexibility of VISPROG on 4\ndiverse tasks - compositional visual question answering, zero-shot reasoning on\nimage pairs, factual knowledge object tagging, and language-guided image\nediting. We believe neuro-symbolic approaches like VISPROG are an exciting\navenue to easily and effectively expand the scope of AI systems to serve the\nlong tail of complex tasks that people may wish to perform.", "comment": null, "links": []}
{"entry_id": "2210.00220", "title": "A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering", "authors": ["Xiaofei Huang", "Hongfang Gong"], "published": "2022-10-01 08:32:40", "updated": "2022-11-12 01:21:33", "summary": "Research in medical visual question answering (MVQA) can contribute to the\ndevelopment of computeraided diagnosis. MVQA is a task that aims to predict\naccurate and convincing answers based on given medical images and associated\nnatural language questions. This task requires extracting medical\nknowledge-rich feature content and making fine-grained understandings of them.\nTherefore, constructing an effective feature extraction and understanding\nscheme are keys to modeling. Existing MVQA question extraction schemes mainly\nfocus on word information, ignoring medical information in the text. Meanwhile,\nsome visual and textual feature understanding schemes cannot effectively\ncapture the correlation between regions and keywords for reasonable visual\nreasoning. In this study, a dual-attention learning network with word and\nsentence embedding (WSDAN) is proposed. We design a module, transformer with\nsentence embedding (TSE), to extract a double embedding representation of\nquestions containing keywords and medical information. A dualattention learning\n(DAL) module consisting of self-attention and guided attention is proposed to\nmodel intensive intramodal and intermodal interactions. With multiple DAL\nmodules (DALs), learning visual and textual co-attention can increase the\ngranularity of understanding and improve visual reasoning. Experimental results\non the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate\nthat our proposed method outperforms previous state-of-the-art methods.\nAccording to the ablation studies and Grad-CAM maps, WSDAN can extract rich\ntextual information and has strong visual reasoning ability.", "comment": null, "links": []}
{"entry_id": "2208.00361", "title": "One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning", "authors": ["Zhipeng Zhang", "Zhimin Wei", "Zhongzhen Huang", "Rui Niu", "Peng Wang"], "published": "2022-07-31 04:51:27", "updated": "2022-10-27 11:30:23", "summary": "Referring Expression Comprehension (REC) is one of the most important tasks\nin visual reasoning that requires a model to detect the target object referred\nby a natural language expression. Among the proposed pipelines, the one-stage\nReferring Expression Comprehension (OSREC) has become the dominant trend since\nit merges the region proposal and selection stages. Many state-of-the-art OSREC\nmodels adopt a multi-hop reasoning strategy because a sequence of objects is\nfrequently mentioned in a single expression which needs multi-hop reasoning to\nanalyze the semantic relation. However, one unsolved issue of these models is\nthat the number of reasoning steps needs to be pre-defined and fixed before\ninference, ignoring the varying complexity of expressions. In this paper, we\npropose a Dynamic Multi-step Reasoning Network, which allows the reasoning\nsteps to be dynamically adjusted based on the reasoning state and expression\ncomplexity. Specifically, we adopt a Transformer module to memorize & process\nthe reasoning state and a Reinforcement Learning strategy to dynamically infer\nthe reasoning steps. The work achieves the state-of-the-art performance or\nsignificant improvements on several REC datasets, ranging from RefCOCO (+, g)\nwith short expressions, to Ref-Reasoning, a dataset with long and complex\ncompositional expressions.", "comment": "27 pages, 6 figures", "links": []}
{"entry_id": "2206.11212", "title": "VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives", "authors": ["Zhuofan Ying", "Peter Hase", "Mohit Bansal"], "published": "2022-06-22 17:02:01", "updated": "2022-10-25 19:25:54", "summary": "Many past works aim to improve visual reasoning in models by supervising\nfeature importance (estimated by model explanation techniques) with human\nannotations such as highlights of important image regions. However, recent work\nhas shown that performance gains from feature importance (FI) supervision for\nVisual Question Answering (VQA) tasks persist even with random supervision,\nsuggesting that these methods do not meaningfully align model FI with human FI.\nIn this paper, we show that model FI supervision can meaningfully improve VQA\nmodel accuracy as well as performance on several Right-for-the-Right-Reason\n(RRR) metrics by optimizing for four key model objectives: (1) accurate\npredictions given limited but sufficient information (Sufficiency); (2)\nmax-entropy predictions given no important information (Uncertainty); (3)\ninvariance of predictions to changes in unimportant features (Invariance); and\n(4) alignment between model FI explanations and human FI explanations\n(Plausibility). Our best performing method, Visual Feature Importance\nSupervision (VisFIS), outperforms strong baselines on benchmark VQA datasets in\nterms of both in-distribution and out-of-distribution accuracy. While past work\nsuggests that the mechanism for improved accuracy is through improved\nexplanation plausibility, we show that this relationship depends crucially on\nexplanation faithfulness (whether explanations truly represent the model's\ninternal reasoning). Predictions are more accurate when explanations are\nplausible and faithful, and not when they are plausible but not faithful.\nLastly, we show that, surprisingly, RRR metrics are not predictive of\nout-of-distribution model accuracy when controlling for a model's\nin-distribution accuracy, which calls into question the value of these metrics\nfor evaluating model reasoning. All supporting code is available at\nhttps://github.com/zfying/visfis", "comment": "NeurIPS 2022 (first two authors contributed equally)", "links": []}
{"entry_id": "2205.11686", "title": "On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization", "authors": ["Shruti Palaskar", "Akshita Bhagia", "Yonatan Bisk", "Florian Metze", "Alan W Black", "Ana Marasović"], "published": "2022-05-24 00:52:40", "updated": "2022-10-22 19:54:28", "summary": "Combining the visual modality with pretrained language models has been\nsurprisingly effective for simple descriptive tasks such as image captioning.\nMore general text generation however remains elusive. We take a step back and\nask: How do these models work for more complex generative tasks, i.e.\nconditioning on both text and images? Are multimodal models simply visually\nadapted language models, or do they combine they reason jointly over\nmodalities?\n  We investigate these questions in the context of self-rationalization\n(jointly generating task labels/answers and free-text explanations) of three\ntasks: (i) visual question answering in VQA-X, (ii) visual commonsense\nreasoning in VCR, and (iii) visual-textual entailment in e-SNLI-VE. We show\nthat recent unimodal advances, CLIP image representations and scaling of\nlanguage models, do not consistently improve self-rationalization in multimodal\ntasks. We find that no single model type works universally best across tasks,\ndatasets, and finetuning data sizes. Our findings motivate the need for novel\ngeneral backbones approach that move text generation from images and text\nbeyond image captioning.", "comment": "v2: EMNLP Findings 2022 accepted paper camera-ready version. 9 pages\n  main, 2 pages appendix", "links": []}
{"entry_id": "2112.08723", "title": "Distilled Dual-Encoder Model for Vision-Language Understanding", "authors": ["Zekun Wang", "Wenhui Wang", "Haichao Zhu", "Ming Liu", "Bing Qin", "Furu Wei"], "published": "2021-12-16 09:21:18", "updated": "2022-10-17 16:27:09", "summary": "We propose a cross-modal attention distillation framework to train a\ndual-encoder model for vision-language understanding tasks, such as visual\nreasoning and visual question answering. Dual-encoder models have a faster\ninference speed than fusion-encoder models and enable the pre-computation of\nimages and text during inference. However, the shallow interaction module used\nin dual-encoder models is insufficient to handle complex vision-language\nunderstanding tasks. In order to learn deep interactions of images and text, we\nintroduce cross-modal attention distillation, which uses the image-to-text and\ntext-to-image attention distributions of a fusion-encoder model to guide the\ntraining of our dual-encoder model. In addition, we show that applying the\ncross-modal attention distillation for both pre-training and fine-tuning stages\nachieves further improvements. Experimental results demonstrate that the\ndistilled dual-encoder model achieves competitive performance for visual\nreasoning, visual entailment and visual question answering tasks while enjoying\na much faster inference speed than fusion-encoder models. Our code and models\nwill be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.", "comment": "EMNLP 2022", "links": []}
{"entry_id": "2208.13628", "title": "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", "authors": ["Mustafa Shukor", "Guillaume Couairon", "Matthieu Cord"], "published": "2022-08-29 14:24:08", "updated": "2022-10-05 11:35:46", "summary": "Vision and Language Pretraining has become the prevalent approach for\ntackling multimodal downstream tasks. The current trend is to move towards ever\nlarger models and pretraining datasets. This computational headlong rush does\nnot seem reasonable in the long term to move toward sustainable solutions, and\nde facto excludes academic laboratories with limited resources. In this work,\nwe propose a new framework, dubbed ViCHA, that efficiently exploits the input\ndata to boost the learning by: (a) a new hierarchical cross-modal alignment\nloss, (b) new self-supervised scheme based on masked image modeling, (c)\nleveraging image-level annotations, called Visual Concepts, obtained with\nexisting foundation models such as CLIP to boost the performance of the image\nencoder. Although pretrained on four times less data, our ViCHA strategy\noutperforms other approaches on several downstream tasks such as Image-Text\nRetrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding. The\ncode will be made publicly available here: https://github.com/mshukor/ViCHA", "comment": "BMVC 2022", "links": []}
{"entry_id": "2209.15087", "title": "Zero-shot visual reasoning through probabilistic analogical mapping", "authors": ["Taylor W. Webb", "Shuhao Fu", "Trevor Bihl", "Keith J. Holyoak", "Hongjing Lu"], "published": "2022-09-29 20:29:26", "updated": "2022-09-29 20:29:26", "summary": "Human reasoning is grounded in an ability to identify highly abstract\ncommonalities governing superficially dissimilar visual inputs. Recent efforts\nto develop algorithms with this capacity have largely focused on approaches\nthat require extensive direct training on visual reasoning tasks, and yield\nlimited generalization to problems with novel content. In contrast, a long\ntradition of research in cognitive science has focused on elucidating the\ncomputational principles underlying human analogical reasoning; however, this\nwork has generally relied on manually constructed representations. Here we\npresent visiPAM (visual Probabilistic Analogical Mapping), a model of visual\nreasoning that synthesizes these two approaches. VisiPAM employs learned\nrepresentations derived directly from naturalistic visual inputs, coupled with\na similarity-based mapping operation derived from cognitive theories of human\nreasoning. We show that without any direct training, visiPAM outperforms a\nstate-of-the-art deep learning model on an analogical mapping task. In\naddition, visiPAM closely matches the pattern of human performance on a novel\ntask involving mapping of 3D objects across disparate categories.", "comment": null, "links": []}
{"entry_id": "2209.11990", "title": "Deep Neural Networks for Visual Reasoning", "authors": ["Thao Minh Le"], "published": "2022-09-24 12:11:00", "updated": "2022-09-24 12:11:00", "summary": "Visual perception and language understanding are - fundamental components of\nhuman intelligence, enabling them to understand and reason about objects and\ntheir interactions. It is crucial for machines to have this capacity to reason\nusing these two modalities to invent new robot-human collaborative systems.\nRecent advances in deep learning have built separate sophisticated\nrepresentations of both visual scenes and languages. However, understanding the\nassociations between the two modalities in a shared context for multimodal\nreasoning remains a challenge. Focusing on language and vision modalities, this\nthesis advances the understanding of how to exploit and use pivotal aspects of\nvision-and-language tasks with neural networks to support reasoning. We derive\nthese understandings from a series of works, making a two-fold contribution:\n(i) effective mechanisms for content selection and construction of temporal\nrelations from dynamic visual scenes in response to a linguistic query and\npreparing adequate knowledge for the reasoning process (ii) new frameworks to\nperform reasoning with neural networks by exploiting visual-linguistic\nassociations, deduced either directly from data or guided by external priors.", "comment": "PhD thesis", "links": []}
{"entry_id": "2209.07000", "title": "VIPHY: Probing \"Visible\" Physical Commonsense Knowledge", "authors": ["Shikhar Singh", "Ehsan Qasemi", "Muhao Chen"], "published": "2022-09-15 02:06:25", "updated": "2022-09-15 02:06:25", "summary": "In recent years, vision-language models (VLMs) have shown remarkable\nperformance on visual reasoning tasks (e.g. attributes, location). While such\ntasks measure the requisite knowledge to ground and reason over a given visual\ninstance, they do not, however, measure the ability of VLMs to retain and\ngeneralize such knowledge. In this work, we evaluate their ability to acquire\n\"visible\" physical knowledge -- the information that is easily accessible from\nimages of static scenes, particularly across the dimensions of object color,\nsize and space. We build an automatic pipeline to derive a comprehensive\nknowledge resource for calibrating and probing these models. Our results\nindicate a severe gap between model and human performance across all three\ntasks. Furthermore, our caption pretrained baseline (CapBERT) significantly\noutperforms VLMs on both size and spatial tasks -- highlighting that despite\nsufficient access to ground language with visual modality, they struggle to\nretain such knowledge. The dataset and code are available at\nhttps://github.com/Axe--/ViPhy .", "comment": "In Progress (under review)", "links": []}
{"entry_id": "2206.01127", "title": "VL-BEiT: Generative Vision-Language Pretraining", "authors": ["Hangbo Bao", "Wenhui Wang", "Li Dong", "Furu Wei"], "published": "2022-06-02 16:14:19", "updated": "2022-09-03 14:18:55", "summary": "We introduce a vision-language foundation model called VL-BEiT, which is a\nbidirectional multimodal Transformer learned by generative pretraining. Our\nminimalist solution conducts masked prediction on both monomodal and multimodal\ndata with a shared Transformer. Specifically, we perform masked vision-language\nmodeling on image-text pairs, masked language modeling on texts, and masked\nimage modeling on images. VL-BEiT is learned from scratch with one unified\npretraining task, one shared backbone, and one-stage training. Our method is\nconceptually simple and empirically effective. Experimental results show that\nVL-BEiT obtains strong results on various vision-language benchmarks, such as\nvisual question answering, visual reasoning, and image-text retrieval.\nMoreover, our method learns transferable visual features, achieving competitive\nperformance on image classification, and semantic segmentation.", "comment": null, "links": []}
{"entry_id": "2209.01319", "title": "Kinova Gemini: Interactive Robot Grasping with Visual Reasoning and Conversational AI", "authors": ["Hanxiao Chen", "Jiankun Wang", "Max Q. -H. Meng"], "published": "2022-09-03 03:52:07", "updated": "2022-09-03 03:52:07", "summary": "To facilitate recent advances in robotics and AI for delicate collaboration\nbetween humans and machines, we propose the Kinova Gemini, an original robotic\nsystem that integrates conversational AI dialogue and visual reasoning to make\nthe Kinova Gen3 lite robot help people retrieve objects or complete\nperception-based pick-and-place tasks. When a person walks up to Kinova Gen3\nlite, our Kinova Gemini is able to fulfill the user's requests in three\ndifferent applications: (1) It can start a natural dialogue with people to\ninteract and assist humans to retrieve objects and hand them to the user one by\none. (2) It detects diverse objects with YOLO v3 and recognize color attributes\nof the item to ask people if they want to grasp it via the dialogue or enable\nthe user to choose which specific one is required. (3) It applies YOLO v3 to\nrecognize multiple objects and let you choose two items for perception-based\npick-and-place tasks such as \"Put the banana into the bowl\" with visual\nreasoning and conversational interaction.", "comment": null, "links": []}
{"entry_id": "2208.10442", "title": "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", "authors": ["Wenhui Wang", "Hangbo Bao", "Li Dong", "Johan Bjorck", "Zhiliang Peng", "Qiang Liu", "Kriti Aggarwal", "Owais Khan Mohammed", "Saksham Singhal", "Subhojit Som", "Furu Wei"], "published": "2022-08-22 16:55:04", "updated": "2022-08-31 02:26:45", "summary": "A big convergence of language, vision, and multimodal pretraining is\nemerging. In this work, we introduce a general-purpose multimodal foundation\nmodel BEiT-3, which achieves state-of-the-art transfer performance on both\nvision and vision-language tasks. Specifically, we advance the big convergence\nfrom three aspects: backbone architecture, pretraining task, and model scaling\nup. We introduce Multiway Transformers for general-purpose modeling, where the\nmodular architecture enables both deep fusion and modality-specific encoding.\nBased on the shared backbone, we perform masked \"language\" modeling on images\n(Imglish), texts (English), and image-text pairs (\"parallel sentences\") in a\nunified manner. Experimental results show that BEiT-3 obtains state-of-the-art\nperformance on object detection (COCO), semantic segmentation (ADE20K), image\nclassification (ImageNet), visual reasoning (NLVR2), visual question answering\n(VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).", "comment": "18 pages", "links": []}
{"entry_id": "2203.17247", "title": "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", "authors": ["Estelle Aflalo", "Meng Du", "Shao-Yen Tseng", "Yongfei Liu", "Chenfei Wu", "Nan Duan", "Vasudev Lal"], "published": "2022-03-30 05:25:35", "updated": "2022-08-22 22:25:59", "summary": "Breakthroughs in transformer-based models have revolutionized not only the\nNLP field, but also vision and multimodal systems. However, although\nvisualization and interpretability tools have become available for NLP models,\ninternal mechanisms of vision and multimodal transformers remain largely\nopaque. With the success of these transformers, it is increasingly critical to\nunderstand their inner workings, as unraveling these black-boxes will lead to\nmore capable and trustworthy models. To contribute to this quest, we propose\nVL-InterpreT, which provides novel interactive visualizations for interpreting\nthe attentions and hidden representations in multimodal transformers.\nVL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety\nof statistics in attention heads throughout all layers for both vision and\nlanguage components, (2) visualizes cross-modal and intra-modal attentions\nthrough easily readable heatmaps, and (3) plots the hidden representations of\nvision and language tokens as they pass through the transformer layers. In this\npaper, we demonstrate the functionalities of VL-InterpreT through the analysis\nof KD-VLP, an end-to-end pretraining vision-language multimodal\ntransformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and\nWebQA, two visual question answering benchmarks. Furthermore, we also present a\nfew interesting findings about multimodal transformer behaviors that were\nlearned through our tool.", "comment": "Best Demo Award at CVPR 2022", "links": []}
{"entry_id": "2202.04800", "title": "The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning", "authors": ["Jack Hessel", "Jena D. Hwang", "Jae Sung Park", "Rowan Zellers", "Chandra Bhagavatula", "Anna Rohrbach", "Kate Saenko", "Yejin Choi"], "published": "2022-02-10 02:26:45", "updated": "2022-07-25 17:26:06", "summary": "Humans have remarkable capacity to reason abductively and hypothesize about\nwhat lies beyond the literal content of an image. By identifying concrete\nvisual clues scattered throughout a scene, we almost can't help but draw\nprobable inferences beyond the literal scene based on our everyday experience\nand knowledge about the world. For example, if we see a \"20 mph\" sign alongside\na road, we might assume the street sits in a residential area (rather than on a\nhighway), even if no houses are pictured. Can machines perform similar visual\nreasoning?\n  We present Sherlock, an annotated corpus of 103K images for testing machine\ncapacity for abductive reasoning beyond literal image contents. We adopt a\nfree-viewing paradigm: participants first observe and identify salient clues\nwithin images (e.g., objects, actions) and then provide a plausible inference\nabout the scene, given the clue. In total, we collect 363K (clue, inference)\npairs, which form a first-of-its-kind abductive visual reasoning dataset. Using\nour corpus, we test three complementary axes of abductive reasoning. We\nevaluate the capacity of models to: i) retrieve relevant inferences from a\nlarge candidate corpus; ii) localize evidence for inferences via bounding\nboxes, and iii) compare plausible inferences to match human judgments on a\nnewly-collected diagnostic corpus of 19K Likert-scale judgments. While we find\nthat fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong\nbaselines, significant headroom exists between model performance and human\nagreement. Data, models, and leaderboard available at\nhttp://visualabduction.com/", "comment": "code, data, models at http://visualabduction.com/", "links": []}
{"entry_id": "2207.06403", "title": "3D Concept Grounding on Neural Fields", "authors": ["Yining Hong", "Yilun Du", "Chunru Lin", "Joshua B. Tenenbaum", "Chuang Gan"], "published": "2022-07-13 17:59:33", "updated": "2022-07-13 17:59:33", "summary": "In this paper, we address the challenging problem of 3D concept grounding\n(i.e. segmenting and learning visual concepts) by looking at RGBD images and\nreasoning about paired questions and answers. Existing visual reasoning\napproaches typically utilize supervised methods to extract 2D segmentation\nmasks on which concepts are grounded. In contrast, humans are capable of\ngrounding concepts on the underlying 3D representation of images. However,\ntraditionally inferred 3D representations (e.g., point clouds, voxelgrids, and\nmeshes) cannot capture continuous 3D features flexibly, thus making it\nchallenging to ground concepts to 3D regions based on the language description\nof the object being referred to. To address both issues, we propose to leverage\nthe continuous, differentiable nature of neural fields to segment and learn\nconcepts. Specifically, each 3D coordinate in a scene is represented as a\nhigh-dimensional descriptor. Concept grounding can then be performed by\ncomputing the similarity between the descriptor vector of a 3D coordinate and\nthe vector embedding of a language concept, which enables segmentations and\nconcept learning to be jointly learned on neural fields in a differentiable\nfashion. As a result, both 3D semantic and instance segmentations can emerge\ndirectly from question answering supervision using a set of defined neural\noperators on top of neural fields (e.g., filtering and counting). Experimental\nresults show that our proposed framework outperforms\nunsupervised/language-mediated segmentation models on semantic and instance\nsegmentation tasks, as well as outperforms existing models on the challenging\n3D aware visual reasoning tasks. Furthermore, our framework can generalize well\nto unseen shape categories and real scans.", "comment": "Project page: http://3d-cg.csail.mit.edu", "links": []}
{"entry_id": "2109.13156", "title": "DAReN: A Collaborative Approach Towards Reasoning And Disentangling", "authors": ["Pritish Sahu", "Kalliopi Basioti", "Vladimir Pavlovic"], "published": "2021-09-27 16:10:30", "updated": "2022-06-30 01:14:40", "summary": "Computational learning approaches to solving visual reasoning tests, such as\nRaven's Progressive Matrices (RPM), critically depend on the ability to\nidentify the visual concepts used in the test (i.e., the representation) as\nwell as the latent rules based on those concepts (i.e., the reasoning).\nHowever, learning of representation and reasoning is a challenging and\nill-posed task, often approached in a stage-wise manner (first representation,\nthen reasoning). In this work, we propose an end-to-end joint\nrepresentation-reasoning learning framework, which leverages a weak form of\ninductive bias to improve both tasks together. Specifically, we introduce a\ngeneral generative graphical model for RPMs, GM-RPM, and apply it to solve the\nreasoning test. We accomplish this using a novel learning framework\nDisentangling based Abstract Reasoning Network (DAReN) based on the principles\nof GM-RPM. We perform an empirical evaluation of DAReN over several benchmark\ndatasets. DAReN shows consistent improvement over state-of-the-art (SOTA)\nmodels on both the reasoning and the disentanglement tasks. This demonstrates\nthe strong correlation between disentangled latent representation and the\nability to solve abstract visual reasoning tasks.", "comment": null, "links": []}
{"entry_id": "2206.12533", "title": "From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering", "authors": ["Zihao Zhu"], "published": "2022-06-25 02:20:02", "updated": "2022-06-25 02:20:02", "summary": "In order to achieve a general visual question answering (VQA) system, it is\nessential to learn to answer deeper questions that require compositional\nreasoning on the image and external knowledge. Meanwhile, the reasoning process\nshould be explicit and explainable to understand the working mechanism of the\nmodel. It is effortless for human but challenging for machines. In this paper,\nwe propose a Hierarchical Graph Neural Module Network (HGNMN) that reasons over\nmulti-layer graphs with neural modules to address the above issues.\nSpecifically, we first encode the image by multi-layer graphs from the visual,\nsemantic and commonsense views since the clues that support the answer may\nexist in different modalities. Our model consists of several well-designed\nneural modules that perform specific functions over graphs, which can be used\nto conduct multi-step reasoning within and between different graphs. Compared\nto existing modular networks, we extend visual reasoning from one graph to more\ngraphs. We can explicitly trace the reasoning process according to module\nweights and graph attentions. Experiments show that our model not only achieves\nstate-of-the-art performance on the CRIC dataset but also obtains explicit and\nexplainable reasoning procedures.", "comment": null, "links": []}
{"entry_id": "2206.09265", "title": "SAViR-T: Spatially Attentive Visual Reasoning with Transformers", "authors": ["Pritish Sahu", "Kalliopi Basioti", "Vladimir Pavlovic"], "published": "2022-06-18 18:26:20", "updated": "2022-06-22 02:00:11", "summary": "We present a novel computational model, \"SAViR-T\", for the family of visual\nreasoning problems embodied in the Raven's Progressive Matrices (RPM). Our\nmodel considers explicit spatial semantics of visual elements within each image\nin the puzzle, encoded as spatio-visual tokens, and learns the intra-image as\nwell as the inter-image token dependencies, highly relevant for the visual\nreasoning task. Token-wise relationship, modeled through a transformer-based\nSAViR-T architecture, extract group (row or column) driven representations by\nleveraging the group-rule coherence and use this as the inductive bias to\nextract the underlying rule representations in the top two row (or column) per\ntoken in the RPM. We use this relation representations to locate the correct\nchoice image that completes the last row or column for the RPM. Extensive\nexperiments across both synthetic RPM benchmarks, including RAVEN, I-RAVEN,\nRAVEN-FAIR, and PGM, and the natural image-based \"V-PROM\" demonstrate that\nSAViR-T sets a new state-of-the-art for visual reasoning, exceeding prior\nmodels' performance by a considerable margin.", "comment": null, "links": []}
{"entry_id": "2204.11167", "title": "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", "authors": ["Xiaojian Ma", "Weili Nie", "Zhiding Yu", "Huaizu Jiang", "Chaowei Xiao", "Yuke Zhu", "Song-Chun Zhu", "Anima Anandkumar"], "published": "2022-04-24 02:46:43", "updated": "2022-06-11 13:42:27", "summary": "Reasoning about visual relationships is central to how humans interpret the\nvisual world. This task remains challenging for current deep learning\nalgorithms since it requires addressing three key technical problems jointly:\n1) identifying object entities and their properties, 2) inferring semantic\nrelations between pairs of entities, and 3) generalizing to novel\nobject-relation combinations, i.e., systematic generalization. In this work, we\nuse vision transformers (ViTs) as our base model for visual reasoning and make\nbetter use of concepts defined as object entities and their relations to\nimprove the reasoning ability of ViTs. Specifically, we introduce a novel\nconcept-feature dictionary to allow flexible image feature retrieval at\ntraining time with concept keys. This dictionary enables two new concept-guided\nauxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a\nlocal task for facilitating semantic object-centric correspondence learning. To\nexamine the systematic generalization of visual reasoning models, we introduce\nsystematic splits for the standard HICO and GQA benchmarks. We show the\nresulting model, Concept-guided Vision Transformer (or RelViT for short)\nsignificantly outperforms prior approaches on HICO and GQA by 16% and 13% in\nthe original split, and by 43% and 18% in the systematic split. Our ablation\nanalyses also reveal our model's compatibility with multiple ViT variants and\nrobustness to hyper-parameters.", "comment": "ICLR 2022; Code: https://github.com/NVlabs/RelViT", "links": []}
{"entry_id": "2206.05379", "title": "A Benchmark for Compositional Visual Reasoning", "authors": ["Aimen Zerroug", "Mohit Vaishnav", "Julien Colin", "Sebastian Musslick", "Thomas Serre"], "published": "2022-06-11 00:04:49", "updated": "2022-06-11 00:04:49", "summary": "A fundamental component of human vision is our ability to parse complex\nvisual scenes and judge the relations between their constituent objects. AI\nbenchmarks for visual reasoning have driven rapid progress in recent years with\nstate-of-the-art systems now reaching human accuracy on some of these\nbenchmarks. Yet, a major gap remains in terms of the sample efficiency with\nwhich humans and AI systems learn new visual reasoning tasks. Humans'\nremarkable efficiency at learning has been at least partially attributed to\ntheir ability to harness compositionality -- such that they can efficiently\ntake advantage of previously gained knowledge when learning new tasks. Here, we\nintroduce a novel visual reasoning benchmark, Compositional Visual Relations\n(CVR), to drive progress towards the development of more data-efficient\nlearning algorithms. We take inspiration from fluidic intelligence and\nnon-verbal reasoning tests and describe a novel method for creating\ncompositions of abstract rules and associated image datasets at scale. Our\nproposed benchmark includes measures of sample efficiency, generalization and\ntransfer across task rules, as well as the ability to leverage\ncompositionality. We systematically evaluate modern neural architectures and\nfind that, surprisingly, convolutional architectures surpass transformer-based\narchitectures across all performance measures in most data regimes. However,\nall computational models are a lot less data efficient compared to humans even\nafter learning informative visual representations using self-supervision.\nOverall, we hope that our challenge will spur interest in the development of\nneural architectures that can learn to harness compositionality toward more\nefficient learning.", "comment": null, "links": []}
{"entry_id": "2002.06838", "title": "Stratified Rule-Aware Network for Abstract Visual Reasoning", "authors": ["Sheng Hu", "Yuqing Ma", "Xianglong Liu", "Yanlu Wei", "Shihao Bai"], "published": "2020-02-17 08:44:05", "updated": "2022-06-07 11:49:44", "summary": "Abstract reasoning refers to the ability to analyze information, discover\nrules at an intangible level, and solve problems in innovative ways. Raven's\nProgressive Matrices (RPM) test is typically used to examine the capability of\nabstract reasoning. The subject is asked to identify the correct choice from\nthe answer set to fill the missing panel at the bottom right of RPM (e.g., a\n3$\\times$3 matrix), following the underlying rules inside the matrix. Recent\nstudies, taking advantage of Convolutional Neural Networks (CNNs), have\nachieved encouraging progress to accomplish the RPM test. However, they partly\nignore necessary inductive biases of RPM solver, such as order sensitivity\nwithin each row/column and incremental rule induction. To address this problem,\nin this paper we propose a Stratified Rule-Aware Network (SRAN) to generate the\nrule embeddings for two input sequences. Our SRAN learns multiple granularity\nrule embeddings at different levels, and incrementally integrates the\nstratified embedding flows through a gated fusion module. With the help of\nembeddings, a rule similarity metric is applied to guarantee that SRAN can not\nonly be trained using a tuplet loss but also infer the best answer efficiently.\nWe further point out the severe defects existing in the popular RAVEN dataset\nfor RPM test, which prevent from the fair evaluation of the abstract reasoning\nability. To fix the defects, we propose an answer set generation algorithm\ncalled Attribute Bisection Tree (ABT), forming an improved dataset named\nImpartial-RAVEN (I-RAVEN for short). Extensive experiments are conducted on\nboth PGM and I-RAVEN datasets, showing that our SRAN outperforms the\nstate-of-the-art models by a considerable margin.", "comment": "AAAI 2021 paper. Code: https://github.com/husheng12345/SRAN", "links": []}
{"entry_id": "2103.15022", "title": "'Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) Tasks", "authors": ["Man Luo", "Shailaja Keyur Sampat", "Riley Tallman", "Yankai Zeng", "Manuha Vancha", "Akarshan Sajja", "Chitta Baral"], "published": "2021-03-28 00:07:08", "updated": "2022-05-31 18:05:49", "summary": "GQA~\\citep{hudson2019gqa} is a dataset for real-world visual reasoning and\ncompositional question answering. We found that many answers predicted by the\nbest vision-language models on the GQA dataset do not match the ground-truth\nanswer but still are semantically meaningful and correct in the given context.\nIn fact, this is the case with most existing visual question answering (VQA)\ndatasets where they assume only one ground-truth answer for each question. We\npropose Alternative Answer Sets (AAS) of ground-truth answers to address this\nlimitation, which is created automatically using off-the-shelf NLP tools. We\nintroduce a semantic metric based on AAS and modify top VQA solvers to support\nmultiple plausible answers for a question. We implement this approach on the\nGQA dataset and show the performance improvements. Code and data are available\nin this link \\url{https://github.com/luomancs/alternative_answer_set.git}.", "comment": "accepted to EACL 2021", "links": []}
{"entry_id": "2205.14288", "title": "Few-shot Subgoal Planning with Language Models", "authors": ["Lajanugen Logeswaran", "Yao Fu", "Moontae Lee", "Honglak Lee"], "published": "2022-05-28 01:03:30", "updated": "2022-05-28 01:03:30", "summary": "Pre-trained large language models have shown successful progress in many\nlanguage understanding benchmarks. This work explores the capability of these\nmodels to predict actionable plans in real-world environments. Given a text\ninstruction, we show that language priors encoded in pre-trained language\nmodels allow us to infer fine-grained subgoal sequences. In contrast to recent\nmethods which make strong assumptions about subgoal supervision, our\nexperiments show that language models can infer detailed subgoal sequences from\nfew training sequences without any fine-tuning. We further propose a simple\nstrategy to re-rank language model predictions based on interaction and\nfeedback from the environment. Combined with pre-trained navigation and visual\nreasoning components, our approach demonstrates competitive performance on\nsubgoal prediction and task completion in the ALFRED benchmark compared to\nprior methods that assume more subgoal supervision.", "comment": "NAACL 2022", "links": []}
{"entry_id": "2205.12616", "title": "Guiding Visual Question Answering with Attention Priors", "authors": ["Thao Minh Le", "Vuong Le", "Sunil Gupta", "Svetha Venkatesh", "Truyen Tran"], "published": "2022-05-25 09:53:47", "updated": "2022-05-25 09:53:47", "summary": "The current success of modern visual reasoning systems is arguably attributed\nto cross-modality attention mechanisms. However, in deliberative reasoning such\nas in VQA, attention is unconstrained at each step, and thus may serve as a\nstatistical pooling mechanism rather than a semantic operation intended to\nselect information relevant to inference. This is because at training time,\nattention is only guided by a very sparse signal (i.e. the answer label) at the\nend of the inference chain. This causes the cross-modality attention weights to\ndeviate from the desired visual-language bindings. To rectify this deviation,\nwe propose to guide the attention mechanism using explicit linguistic-visual\ngrounding. This grounding is derived by connecting structured linguistic\nconcepts in the query to their referents among the visual objects. Here we\nlearn the grounding from the pairing of questions and images alone, without the\nneed for answer annotation or external grounding supervision. This grounding\nguides the attention mechanism inside VQA models through a duality of\nmechanisms: pre-training attention weight calculation and directly guiding the\nweights at inference time on a case-by-case basis. The resultant algorithm is\ncapable of probing attention-based reasoning models, injecting relevant\nassociative knowledge, and regulating the core reasoning process. This scalable\nenhancement improves the performance of VQA models, fortifies their robustness\nto limited access to supervised data, and increases interpretability.", "comment": "Preprint, 10 pages", "links": []}
{"entry_id": "2205.08013", "title": "Continual learning on 3D point clouds with random compressed rehearsal", "authors": ["Maciej Zamorski", "Michał Stypułkowski", "Konrad Karanowski", "Tomasz Trzciński", "Maciej Zięba"], "published": "2022-05-16 22:59:52", "updated": "2022-05-20 12:09:47", "summary": "Contemporary deep neural networks offer state-of-the-art results when applied\nto visual reasoning, e.g., in the context of 3D point cloud data. Point clouds\nare important datatype for precise modeling of three-dimensional environments,\nbut effective processing of this type of data proves to be challenging. In the\nworld of large, heavily-parameterized network architectures and\ncontinuously-streamed data, there is an increasing need for machine learning\nmodels that can be trained on additional data. Unfortunately, currently\navailable models cannot fully leverage training on additional data without\nlosing their past knowledge. Combating this phenomenon, called catastrophic\nforgetting, is one of the main objectives of continual learning. Continual\nlearning for deep neural networks has been an active field of research,\nprimarily in 2D computer vision, natural language processing, reinforcement\nlearning, and robotics. However, in 3D computer vision, there are hardly any\ncontinual learning solutions specifically designed to take advantage of point\ncloud structure. This work proposes a novel neural network architecture capable\nof continual learning on 3D point cloud data. We utilize point cloud structure\nproperties for preserving a heavily compressed set of past data. By using\nrehearsal and reconstruction as regularization methods of the learning process,\nour approach achieves a significant decrease of catastrophic forgetting\ncompared to the existing solutions on several most popular point cloud datasets\nconsidering two continual learning settings: when a task is known beforehand,\nand in the challenging scenario of when task information is unknown to the\nmodel.", "comment": "10 pages, 3 figures", "links": []}
{"entry_id": "2201.02639", "title": "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound", "authors": ["Rowan Zellers", "Jiasen Lu", "Ximing Lu", "Youngjae Yu", "Yanpeng Zhao", "Mohammadreza Salehi", "Aditya Kusupati", "Jack Hessel", "Ali Farhadi", "Yejin Choi"], "published": "2022-01-07 19:00:21", "updated": "2022-05-13 14:25:04", "summary": "As humans, we navigate a multimodal world, building a holistic understanding\nfrom all our senses. We introduce MERLOT Reserve, a model that represents\nvideos jointly over time -- through a new training objective that learns from\naudio, subtitles, and video frames. Given a video, we replace snippets of text\nand audio with a MASK token; the model learns by choosing the correct\nmasked-out snippet. Our objective learns faster than alternatives, and performs\nwell at scale: we pretrain on 20 million YouTube videos.\n  Empirical results show that MERLOT Reserve learns strong multimodal\nrepresentations. When finetuned, it sets state-of-the-art on Visual Commonsense\nReasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%,\nand 1.5% respectively. Ablations show that these tasks benefit from audio\npretraining -- even VCR, a QA task centered around images (without sound).\nMoreover, our objective enables out-of-the-box prediction, revealing strong\nmultimodal commonsense understanding. In a fully zero-shot setting, our model\nobtains competitive results on four video tasks, even outperforming supervised\napproaches on the recently proposed Situated Reasoning (STAR) benchmark.\n  We analyze why audio enables better vision-language representations,\nsuggesting significant opportunities for future research. We conclude by\ndiscussing ethical and societal implications of multimodal pretraining.", "comment": "CVPR 2022. Project page at https://rowanzellers.com/merlotreserve", "links": []}
{"entry_id": "2205.04061", "title": "Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering", "authors": ["Min Peng", "Chongyang Wang", "Yuan Gao", "Yu Shi", "Xiang-Dong Zhou"], "published": "2022-05-09 06:28:56", "updated": "2022-05-09 06:28:56", "summary": "Video question answering (VideoQA) is challenging given its multimodal\ncombination of visual understanding and natural language processing. While most\nexisting approaches ignore the visual appearance-motion information at\ndifferent temporal scales, it is unknown how to incorporate the multilevel\nprocessing capacity of a deep learning model with such multiscale information.\nTargeting these issues, this paper proposes a novel Multilevel Hierarchical\nNetwork (MHN) with multiscale sampling for VideoQA. MHN comprises two modules,\nnamely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning\n(PVR). With a multiscale sampling, RMI iterates the interaction of\nappearance-motion information at each scale and the question embeddings to\nbuild the multilevel question-guided visual representations. Thereon, with a\nshared transformer encoder, PVR infers the visual cues at each level in\nparallel to fit with answering different question types that may rely on the\nvisual information at relevant levels. Through extensive experiments on three\nVideoQA datasets, we demonstrate improved performances than previous\nstate-of-the-arts and justify the effectiveness of each part of our method.", "comment": "Accepted by IJCAI 2022. arXiv admin note: text overlap with\n  arXiv:2109.04735", "links": []}
{"entry_id": "2205.03854", "title": "Introduction to Soar", "authors": ["John E. Laird"], "published": "2022-05-08 12:44:51", "updated": "2022-05-08 12:44:51", "summary": "This paper is the recommended initial reading for a functional overview of\nSoar, version 9.6. It includes an abstract overview of the architectural\nstructure of Soar including its processing, memories, learning modules, their\ninterfaces, and the representations of knowledge used by those modules. From\nthere it describes the processing supported by those modules, including\ndecision making, impasses and substates, procedure learning via chunking,\nreinforcement learning, semantic memory, episodic memory, and spatial-visual\nreasoning. It then reviews the levels of decision making and variety of\nlearning in Soar, and analysis of Soar as an architecture supporting general\nhuman-level AI. Following the references is an appendix that contains short\ndescriptions of recent Soar agents and a glossary of the terminology we use in\ndescribing Soar.", "comment": "29 pages", "links": []}
{"entry_id": "2205.03075", "title": "QLEVR: A Diagnostic Dataset for Quantificational Language and Elementary Visual Reasoning", "authors": ["Zechen Li", "Anders Søgaard"], "published": "2022-05-06 08:51:13", "updated": "2022-05-06 08:51:13", "summary": "Synthetic datasets have successfully been used to probe visual\nquestion-answering datasets for their reasoning abilities. CLEVR\n(johnson2017clevr), for example, tests a range of visual reasoning abilities.\nThe questions in CLEVR focus on comparisons of shapes, colors, and sizes,\nnumerical reasoning, and existence claims. This paper introduces a minimally\nbiased, diagnostic visual question-answering dataset, QLEVR, that goes beyond\nexistential and numerical quantification and focus on more complex quantifiers\nand their combinations, e.g., asking whether there are more than two red balls\nthat are smaller than at least three blue balls in an image. We describe how\nthe dataset was created and present a first evaluation of state-of-the-art\nvisual question-answering models, showing that QLEVR presents a formidable\nchallenge to our current models. Code and Dataset are available at\nhttps://github.com/zechenli03/QLEVR", "comment": "To appear at Findings of NAACL 2022", "links": []}
{"entry_id": "2204.10496", "title": "Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks", "authors": ["Zhecan Wang", "Noel Codella", "Yen-Chun Chen", "Luowei Zhou", "Xiyang Dai", "Bin Xiao", "Jianwei Yang", "Haoxuan You", "Kai-Wei Chang", "Shih-fu Chang", "Lu Yuan"], "published": "2022-04-22 04:41:04", "updated": "2022-04-28 17:43:36", "summary": "Cross-modal encoders for vision-language (VL) tasks are often pretrained with\ncarefully curated vision-language datasets. While these datasets reach an order\nof 10 million samples, the labor cost is prohibitive to scale further.\nConversely, unimodal encoders are pretrained with simpler annotations that are\nless cost-prohibitive, achieving scales of hundreds of millions to billions. As\na result, unimodal encoders have achieved state-of-art (SOTA) on many\ndownstream tasks. However, challenges remain when applying to VL tasks. The\npretraining data is not optimal for cross-modal architectures and requires\nheavy computational resources. In addition, unimodal architectures lack\ncross-modal interactions that have demonstrated significant benefits for VL\ntasks. Therefore, how to best leverage pretrained unimodal encoders for VL\ntasks is still an area of active research. In this work, we propose a method to\nleverage unimodal vision and text encoders for VL tasks that augment existing\nVL approaches while conserving computational complexity. Specifically, we\npropose Multimodal Adaptive Distillation (MAD), which adaptively distills\nuseful knowledge from pretrained encoders to cross-modal VL encoders. Second,\nto better capture nuanced impacts on VL task performance, we introduce an\nevaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual\nEntailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of\ndata constraints and conditions of domain shift. Experiments demonstrate that\nMAD leads to consistent gains in the low-shot, domain-shifted, and\nfully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA\nperformance on VCR compared to other single models pretrained with image-text\ndata. Finally, MAD outperforms concurrent works utilizing pretrained vision\nencoder from CLIP. Code will be made available.", "comment": "arXiv admin note: substantial text overlap with arXiv:2201.05729", "links": []}
{"entry_id": "2204.11922", "title": "Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks", "authors": ["Navid Rezaei", "Marek Z. Reformat"], "published": "2022-04-25 18:56:55", "updated": "2022-04-25 18:56:55", "summary": "Pre-trained language models have shown excellent results in few-shot learning\nscenarios using in-context learning. Although it is impressive, the size of\nlanguage models can be prohibitive to make them usable in on-device\napplications, such as sensors or smartphones. With smaller language models,\ntask-specific data annotation is needed to fine-tune the language model for a\nspecific purpose. However, data annotation can have a substantial financial and\ntime burden for small research groups, startups, and even companies. In this\npaper, we analyze different prompt-based fine-tuning techniques to improve\nresults on both language and multimodal causal transformer models. To evaluate\nour results, we use a dataset focusing on visual commonsense reasoning in time.\nOur results show that by simple model-agnostic prompt-based fine-tuning,\ncomparable results can be reached by only using 35%-40% of the fine-tuning\ntraining dataset. The proposed approaches result in significant time and\nfinancial savings. As the proposed methods make minimal architectural\nassumptions, other researchers can use the results in their transformer models\nwith minimal adaptations. We plan to release the source code freely to make it\neasier for the community to use and contribute to our work.", "comment": null, "links": []}
{"entry_id": "2201.12382", "title": "Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2022-01-28 19:24:30", "updated": "2022-04-19 18:41:45", "summary": "Abstract visual reasoning (AVR) domain encompasses problems solving which\nrequires the ability to reason about relations among entities present in a\ngiven scene. While humans, generally, solve AVR tasks in a \"natural\" way, even\nwithout prior experience, this type of problems has proven difficult for\ncurrent machine learning systems. The paper summarises recent progress in\napplying deep learning methods to solving AVR problems, as a proxy for studying\nmachine intelligence. We focus on the most common type of AVR tasks -- the\nRaven's Progressive Matrices (RPMs) -- and provide a comprehensive review of\nthe learning methods and deep neural models applied to solve RPMs, as well as,\nthe RPM benchmark sets. Performance analysis of the state-of-the-art approaches\nto solving RPMs leads to formulation of certain insights and remarks on the\ncurrent and future trends in this area. We conclude the paper by demonstrating\nhow real-world problems can benefit from the discoveries of RPM studies.", "comment": null, "links": []}
{"entry_id": "2204.08027", "title": "Attention Mechanism based Cognition-level Scene Understanding", "authors": ["Xuejiao Tang", "Tai Le Quy", "Eirini Ntoutsi", "Kea Turner", "Vasile Palade", "Israat Haque", "Peng Xu", "Chris Brown", "Wenbin Zhang"], "published": "2022-04-17 15:04:44", "updated": "2022-04-19 02:40:42", "summary": "Given a question-image input, the Visual Commonsense Reasoning (VCR) model\ncan predict an answer with the corresponding rationale, which requires\ninference ability from the real world. The VCR task, which calls for exploiting\nthe multi-source information as well as learning different levels of\nunderstanding and extensive commonsense knowledge, is a cognition-level scene\nunderstanding task. The VCR task has aroused researchers' interest due to its\nwide range of applications, including visual question answering, automated\nvehicle systems, and clinical decision support. Previous approaches to solving\nthe VCR task generally rely on pre-training or exploiting memory with long\ndependency relationship encoded models. However, these approaches suffer from a\nlack of generalizability and losing information in long sequences. In this\npaper, we propose a parallel attention-based cognitive VCR network PAVCR, which\nfuses visual-textual information efficiently and encodes semantic information\nin parallel to enable the model to capture rich information for cognition-level\ninference. Extensive experiments show that the proposed model yields\nsignificant improvements over existing methods on the benchmark VCR dataset.\nMoreover, the proposed model provides intuitive interpretation into visual\ncommonsense reasoning.", "comment": "arXiv admin note: text overlap with arXiv:2108.02924,\n  arXiv:2107.01671", "links": []}
{"entry_id": "2204.05543", "title": "Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal Fusion with Depth Guidance", "authors": ["Lei Zhang", "Kang Liao", "Chunyu Lin", "Yao Zhao"], "published": "2022-04-12 06:06:50", "updated": "2022-04-12 06:06:50", "summary": "Image outpainting technology generates visually reasonable content regardless\nof authenticity, making it unreliable to serve for practical applications even\nthough introducing additional modalities eg. the sketch. Since sparse depth\nmaps are widely captured in robotics and autonomous systems, together with RGB\nimages, we combine the sparse depth in the image outpainting task to provide\nmore reliable performance. Concretely, we propose a Depth-Guided Outpainting\nNetwork (DGONet) to model the feature representations of different modalities\ndifferentially and learn the structure-aware cross-modal fusion. To this end,\ntwo components are designed to implement: 1) The Multimodal Learning Module\nproduces unique depth and RGB feature representations from the perspectives of\ndifferent modal characteristics. 2) The Depth Guidance Fusion Module leverages\nthe complete depth modality to guide the establishment of RGB contents by\nprogressive multimodal feature fusion. Furthermore, we specially design an\nadditional constraint strategy consisting of Cross-modal Loss and Edge Loss to\nenhance ambiguous contours and expedite reliable content generation. Extensive\nexperiments on KITTI demonstrate our superiority over the state-of-the-art\nmethods with more reliable content generation.", "comment": null, "links": []}
{"entry_id": "2204.02380", "title": "CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations", "authors": ["Leonard Salewski", "A. Sophia Koepke", "Hendrik P. A. Lensch", "Zeynep Akata"], "published": "2022-04-05 17:38:04", "updated": "2022-04-05 17:38:04", "summary": "Providing explanations in the context of Visual Question Answering (VQA)\npresents a fundamental problem in machine learning. To obtain detailed insights\ninto the process of generating natural language explanations for VQA, we\nintroduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with\nnatural language explanations. For each image-question pair in the CLEVR\ndataset, CLEVR-X contains multiple structured textual explanations which are\nderived from the original scene graphs. By construction, the CLEVR-X\nexplanations are correct and describe the reasoning and visual information that\nis necessary to answer a given question. We conducted a user study to confirm\nthat the ground-truth explanations in our proposed dataset are indeed complete\nand relevant. We present baseline results for generating natural language\nexplanations in the context of VQA using two state-of-the-art frameworks on the\nCLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation\ngeneration quality for different question and answer types. Additionally, we\nstudy the influence of using different numbers of ground-truth explanations on\nthe convergence of natural language generation (NLG) metrics. The CLEVR-X\ndataset is publicly available at\n\\url{https://explainableml.github.io/CLEVR-X/}.", "comment": null, "links": ["http://dx.doi.org/10.1007/978-3-031-04083-2_5"]}
{"entry_id": "2204.00879", "title": "Co-VQA : Answering by Interactive Sub Question Sequence", "authors": ["Ruonan Wang", "Yuxi Qian", "Fangxiang Feng", "Xiaojie Wang", "Huixing Jiang"], "published": "2022-04-02 15:09:16", "updated": "2022-04-02 15:09:16", "summary": "Most existing approaches to Visual Question Answering (VQA) answer questions\ndirectly, however, people usually decompose a complex question into a sequence\nof simple sub questions and finally obtain the answer to the original question\nafter answering the sub question sequence(SQS). By simulating the process, this\npaper proposes a conversation-based VQA (Co-VQA) framework, which consists of\nthree components: Questioner, Oracle, and Answerer. Questioner raises the sub\nquestions using an extending HRED model, and Oracle answers them one-by-one. An\nAdaptive Chain Visual Reasoning Model (ACVRM) for Answerer is also proposed,\nwhere the question-answer pair is used to update the visual representation\nsequentially. To perform supervised learning for each model, we introduce a\nwell-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2\ndatasets. Experimental results show that our method achieves state-of-the-art\non VQA-CP v2. Further analyses show that SQSs help build direct semantic\nconnections between questions and images, provide question-adaptive\nvariable-length reasoning chains, and with explicit interpretability as well as\nerror traceability.", "comment": "Accepted by Findings of ACL 2022", "links": []}
{"entry_id": "2203.14040", "title": "Visual Abductive Reasoning", "authors": ["Chen Liang", "Wenguan Wang", "Tianfei Zhou", "Yi Yang"], "published": "2022-03-26 10:17:03", "updated": "2022-03-26 10:17:03", "summary": "Abductive reasoning seeks the likeliest possible explanation for partial\nobservations. Although abduction is frequently employed in human daily\nreasoning, it is rarely explored in computer vision literature. In this paper,\nwe propose a new task and dataset, Visual Abductive Reasoning (VAR), for\nexamining abductive reasoning ability of machine intelligence in everyday\nvisual situations. Given an incomplete set of visual events, AI systems are\nrequired to not only describe what is observed, but also infer the hypothesis\nthat can best explain the visual premise. Based on our large-scale VAR dataset,\nwe devise a strong baseline model, Reasoner (causal-and-cascaded reasoning\nTransformer). First, to capture the causal structure of the observations, a\ncontextualized directional position embedding strategy is adopted in the\nencoder, that yields discriminative representations for the premise and\nhypothesis. Then, multiple decoders are cascaded to generate and progressively\nrefine the premise and hypothesis sentences. The prediction scores of the\nsentences are used to guide cross-sentence information flow in the cascaded\nreasoning procedure. Our VAR benchmarking results show that Reasoner surpasses\nmany famous video-language models, while still being far behind human\nperformance. This work is expected to foster future efforts in the\nreasoning-beyond-observation paradigm.", "comment": "CVPR2022; Code, data: https://github.com/leonnnop/VAR", "links": []}
{"entry_id": "2105.07122", "title": "Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues", "authors": ["Qingxiu Dong", "Ziwei Qin", "Heming Xia", "Tian Feng", "Shoujie Tong", "Haoran Meng", "Lin Xu", "Weidong Zhan", "Sujian Li", "Zhongyu Wei", "Tianyu Liu", "Zuifang Sui"], "published": "2021-05-15 03:25:42", "updated": "2022-03-17 04:11:58", "summary": "It is a common practice for recent works in vision language cross-modal\nreasoning to adopt a binary or multi-choice classification formulation taking\nas input a set of source image(s) and textual query. In this work, we take a\nsober look at such an unconditional formulation in the sense that no prior\nknowledge is specified with respect to the source image(s). Inspired by the\ndesigns of both visual commonsense reasoning and natural language inference\ntasks, we propose a new task termed Premise-based Multi-modal Reasoning(PMR)\nwhere a textual premise is the background presumption on each source image. The\nPMR dataset contains 15,360 manually annotated samples which are created by a\nmulti-phase crowd-sourcing process. With selected high-quality movie\nscreenshots and human-curated premise templates from 6 pre-defined categories,\nwe ask crowd-source workers to write one true hypothesis and three distractors\n(4 choices) given the premise and image through a cross-check procedure.\nBesides, we generate adversarial samples to alleviate the annotation artifacts\nand double the size of PMR. We benchmark various state-of-the-art (pretrained)\nmulti-modal inference models on PMR and conduct comprehensive experimental\nanalyses to showcase the utility of our dataset.", "comment": "ACL 2022 Main conference (Long Paper)", "links": []}
{"entry_id": "2203.07303", "title": "All in One: Exploring Unified Video-Language Pre-training", "authors": ["Alex Jinpeng Wang", "Yixiao Ge", "Rui Yan", "Yuying Ge", "Xudong Lin", "Guanyu Cai", "Jianping Wu", "Ying Shan", "Xiaohu Qie", "Mike Zheng Shou"], "published": "2022-03-14 17:06:30", "updated": "2022-03-14 17:06:30", "summary": "Mainstream Video-Language Pre-training models \\cite{actbert,clipbert,violet}\nconsist of three parts, a video encoder, a text encoder, and a video-text\nfusion Transformer. They pursue better performance via utilizing heavier\nunimodal encoders or multimodal fusion Transformers, resulting in increased\nparameters with lower efficiency in downstream tasks. In this work, we for the\nfirst time introduce an end-to-end video-language model, namely\n\\textit{all-in-one Transformer}, that embeds raw video and textual signals into\njoint representations using a unified backbone architecture. We argue that the\nunique temporal information of video data turns out to be a key barrier\nhindering the design of a modality-agnostic Transformer. To overcome the\nchallenge, we introduce a novel and effective token rolling operation to encode\ntemporal representations from video clips in a non-parametric manner. The\ncareful design enables the representation learning of both video-text\nmultimodal inputs and unimodal inputs using a unified backbone model. Our\npre-trained all-in-one Transformer is transferred to various downstream\nvideo-text tasks after fine-tuning, including text-video retrieval,\nvideo-question answering, multiple choice and visual commonsense reasoning.\nState-of-the-art performances with the minimal model FLOPs on nine datasets\ndemonstrate the superiority of our method compared to the competitive\ncounterparts. The code and pretrained model have been released in\nhttps://github.com/showlab/all-in-one.", "comment": "18 pages. 11 figures. Code: https://github.com/showlab/all-in-one", "links": []}
{"entry_id": "2203.06107", "title": "REX: Reasoning-aware and Grounded Explanation", "authors": ["Shi Chen", "Qi Zhao"], "published": "2022-03-11 17:28:42", "updated": "2022-03-11 17:28:42", "summary": "Effectiveness and interpretability are two essential properties for\ntrustworthy AI systems. Most recent studies in visual reasoning are dedicated\nto improving the accuracy of predicted answers, and less attention is paid to\nexplaining the rationales behind the decisions. As a result, they commonly take\nadvantage of spurious biases instead of actually reasoning on the\nvisual-textual data, and have yet developed the capability to explain their\ndecision making by considering key information from both modalities. This paper\naims to close the gap from three distinct perspectives: first, we define a new\ntype of multi-modal explanations that explain the decisions by progressively\ntraversing the reasoning process and grounding keywords in the images. We\ndevelop a functional program to sequentially execute different reasoning steps\nand construct a new dataset with 1,040,830 multi-modal explanations. Second, we\nidentify the critical need to tightly couple important components across the\nvisual and textual modalities for explaining the decisions, and propose a novel\nexplanation generation method that explicitly models the pairwise\ncorrespondence between words and regions of interest. It improves the visual\ngrounding capability by a considerable margin, resulting in enhanced\ninterpretability and reasoning performance. Finally, with our new data and\nmethod, we perform extensive analyses to study the effectiveness of our\nexplanation under different settings, including multi-task learning and\ntransfer learning. Our code and data are available at\nhttps://github.com/szzexpoi/rex.", "comment": "To appear in CVPR2022", "links": []}
{"entry_id": "2202.10284", "title": "A Review of Emerging Research Directions in Abstract Visual Reasoning", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2022-02-21 14:58:02", "updated": "2022-03-07 09:56:09", "summary": "Abstract Visual Reasoning (AVR) problems are commonly used to approximate\nhuman intelligence. They test the ability of applying previously gained\nknowledge, experience and skills in a completely new setting, which makes them\nparticularly well-suited for this task. Recently, the AVR problems have become\npopular as a proxy to study machine intelligence, which has led to emergence of\nnew distinct types of problems and multiple benchmark sets. In this work we\nreview this emerging AVR research and propose a taxonomy to categorise the AVR\ntasks along 5 dimensions: input shapes, hidden rules, target task, cognitive\nfunction, and main challenge. The perspective taken in this survey allows to\ncharacterise AVR problems with respect to their shared and distinct properties,\nprovides a unified view on the existing approaches for solving AVR tasks, shows\nhow the AVR problems relate to practical applications, and outlines promising\ndirections for future work. One of them refers to the observation that in the\nmachine learning literature different tasks are considered in isolation, which\nis in the stark contrast with the way the AVR tasks are used to measure human\nintelligence, where multiple types of problems are combined within a single IQ\ntest.", "comment": null, "links": []}
{"entry_id": "2202.12162", "title": "Measuring CLEVRness: Blackbox testing of Visual Reasoning Models", "authors": ["Spyridon Mouselinos", "Henryk Michalewski", "Mateusz Malinowski"], "published": "2022-02-24 15:59:29", "updated": "2022-02-28 14:02:08", "summary": "How can we measure the reasoning capabilities of intelligence systems? Visual\nquestion answering provides a convenient framework for testing the model's\nabilities by interrogating the model through questions about the scene.\nHowever, despite scores of various visual QA datasets and architectures, which\nsometimes yield even a super-human performance, the question of whether those\narchitectures can actually reason remains open to debate. To answer this, we\nextend the visual question answering framework and propose the following\nbehavioral test in the form of a two-player game. We consider black-box neural\nmodels of CLEVR. These models are trained on a diagnostic dataset benchmarking\nreasoning. Next, we train an adversarial player that re-configures the scene to\nfool the CLEVR model. We show that CLEVR models, which otherwise could perform\nat a human level, can easily be fooled by our agent. Our results put in doubt\nwhether data-driven approaches can do reasoning without exploiting the numerous\nbiases that are often present in those datasets. Finally, we also propose a\ncontrolled experiment measuring the efficiency of such models to learn and\nperform reasoning.", "comment": "ICLR 2022", "links": []}
{"entry_id": "2202.13115", "title": "Analysis of Visual Reasoning on One-Stage Object Detection", "authors": ["Tolga Aksoy", "Ugur Halici"], "published": "2022-02-26 11:11:59", "updated": "2022-02-26 11:11:59", "summary": "Current state-of-the-art one-stage object detectors are limited by treating\neach image region separately without considering possible relations of the\nobjects. This causes dependency solely on high-quality convolutional feature\nrepresentations for detecting objects successfully. However, this may not be\npossible sometimes due to some challenging conditions. In this paper, the usage\nof reasoning features on one-stage object detection is analyzed. We attempted\ndifferent architectures that reason the relations of the image regions by using\nself-attention. YOLOv3-Reasoner2 model spatially and semantically enhances\nfeatures in the reasoning layer and fuses them with the original convolutional\nfeatures to improve performance. The YOLOv3-Reasoner2 model achieves around\n2.5% absolute improvement with respect to baseline YOLOv3 on COCO in terms of\nmAP while still running in real-time.", "comment": "Submitted to IEEE International Conference on Image Processing (ICIP)\n  2022", "links": []}
{"entry_id": "2202.01334", "title": "Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization", "authors": ["Dianbo Liu", "Alex Lamb", "Xu Ji", "Pascal Notsawo", "Mike Mozer", "Yoshua Bengio", "Kenji Kawaguchi"], "published": "2022-02-02 23:54:26", "updated": "2022-02-02 23:54:26", "summary": "Vector Quantization (VQ) is a method for discretizing latent representations\nand has become a major part of the deep learning toolkit. It has been\ntheoretically and empirically shown that discretization of representations\nleads to improved generalization, including in reinforcement learning where\ndiscretization can be used to bottleneck multi-agent communication to promote\nagent specialization and robustness. The discretization tightness of most\nVQ-based methods is defined by the number of discrete codes in the\nrepresentation vector and the codebook size, which are fixed as\nhyperparameters. In this work, we propose learning to dynamically select\ndiscretization tightness conditioned on inputs, based on the hypothesis that\ndata naturally contains variations in complexity that call for different levels\nof representational coarseness. We show that dynamically varying tightness in\ncommunication bottlenecks can improve model performance on visual reasoning and\nreinforcement learning tasks.", "comment": null, "links": []}
{"entry_id": "2101.01169", "title": "Transformers in Vision: A Survey", "authors": ["Salman Khan", "Muzammal Naseer", "Munawar Hayat", "Syed Waqas Zamir", "Fahad Shahbaz Khan", "Mubarak Shah"], "published": "2021-01-04 18:57:24", "updated": "2022-01-19 05:49:50", "summary": "Astounding results from Transformer models on natural language tasks have\nintrigued the vision community to study their application to computer vision\nproblems. Among their salient benefits, Transformers enable modeling long\ndependencies between input sequence elements and support parallel processing of\nsequence as compared to recurrent networks e.g., Long short-term memory (LSTM).\nDifferent from convolutional networks, Transformers require minimal inductive\nbiases for their design and are naturally suited as set-functions. Furthermore,\nthe straightforward design of Transformers allows processing multiple\nmodalities (e.g., images, videos, text and speech) using similar processing\nblocks and demonstrates excellent scalability to very large capacity networks\nand huge datasets. These strengths have led to exciting progress on a number of\nvision tasks using Transformer networks. This survey aims to provide a\ncomprehensive overview of the Transformer models in the computer vision\ndiscipline. We start with an introduction to fundamental concepts behind the\nsuccess of Transformers i.e., self-attention, large-scale pre-training, and\nbidirectional encoding. We then cover extensive applications of transformers in\nvision including popular recognition tasks (e.g., image classification, object\ndetection, action recognition, and segmentation), generative modeling,\nmulti-modal tasks (e.g., visual-question answering, visual reasoning, and\nvisual grounding), video processing (e.g., activity recognition, video\nforecasting), low-level vision (e.g., image super-resolution, image\nenhancement, and colorization) and 3D analysis (e.g., point cloud\nclassification and segmentation). We compare the respective advantages and\nlimitations of popular techniques both in terms of architectural design and\ntheir experimental value. Finally, we provide an analysis on open research\ndirections and possible future works.", "comment": "30 pages (Accepted in ACM Computing Surveys December 2021)", "links": ["http://dx.doi.org/10.1145/3505244"]}
{"entry_id": "2111.12301", "title": "Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction", "authors": ["Wentao He", "Jianfeng Ren", "Ruibin Bai", "Xudong Jiang"], "published": "2021-11-24 06:51:38", "updated": "2022-01-05 04:40:43", "summary": "Raven's Progressive Matrices (RPMs) are frequently used in evaluating human's\nvisual reasoning ability. Researchers have made considerable efforts in\ndeveloping systems to automatically solve the RPM problem, often through a\nblack-box end-to-end convolutional neural network for both visual recognition\nand logical reasoning tasks. Based on the two intrinsic natures of RPM problem,\nvisual recognition and logical reasoning, we propose a Two-stage Rule-Induction\nVisual Reasoner (TRIVR), which consists of a perception module and a reasoning\nmodule, to tackle the challenges of real-world visual recognition and\nsubsequent logical reasoning tasks, respectively. For the reasoning module, we\nfurther propose a \"2+1\" formulation that models human's thinking in solving\nRPMs and significantly reduces the model complexity. It derives a reasoning\nrule from each RPM sample, which is not feasible for existing methods. As a\nresult, the proposed reasoning module is capable of yielding a set of reasoning\nrules modeling human in solving the RPM problems. To validate the proposed\nmethod on real-world applications, an RPM-like Video Prediction (RVP) dataset\nis constructed, where visual reasoning is conducted on RPMs constructed using\nreal-world video frames. Experimental results on various RPM-like datasets\ndemonstrate that the proposed TRIVR achieves a significant and consistent\nperformance gain compared with the state-of-the-art models.", "comment": "Under review", "links": []}
{"entry_id": "2112.11691", "title": "CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes", "authors": ["Xu Yan", "Zhihao Yuan", "Yuhao Du", "Yinghong Liao", "Yao Guo", "Zhen Li", "Shuguang Cui"], "published": "2021-12-22 06:43:21", "updated": "2021-12-31 09:13:52", "summary": "3D scene understanding is a relatively emerging research field. In this\npaper, we introduce the Visual Question Answering task in 3D real-world scenes\n(VQA-3D), which aims to answer all possible questions given a 3D scene. To\ntackle this problem, the first VQA-3D dataset, namely CLEVR3D, is proposed,\nwhich contains 60K questions in 1,129 real-world scenes. Specifically, we\ndevelop a question engine leveraging 3D scene graph structures to generate\ndiverse reasoning questions, covering the questions of objects' attributes\n(i.e., size, color, and material) and their spatial relationships. Built upon\nthis dataset, we further design the first VQA-3D baseline model, TransVQA3D.\nThe TransVQA3D model adopts well-designed Transformer architectures to achieve\nsuperior VQA-3D performance, compared with the pure language baseline and\nprevious 3D reasoning methods directly applied to 3D scenarios. Experimental\nresults verify that taking VQA-3D as an auxiliary task can boost the\nperformance of 3D scene understanding, including scene graph analysis for the\nnode-wise classification and whole-graph recognition.", "comment": null, "links": []}
{"entry_id": "2112.15324", "title": "Deconfounded Visual Grounding", "authors": ["Jianqiang Huang", "Yu Qin", "Jiaxin Qi", "Qianru Sun", "Hanwang Zhang"], "published": "2021-12-31 07:14:59", "updated": "2021-12-31 07:14:59", "summary": "We focus on the confounding bias between language and location in the visual\ngrounding pipeline, where we find that the bias is the major visual reasoning\nbottleneck. For example, the grounding process is usually a trivial\nlanguage-location association without visual reasoning, e.g., grounding any\nlanguage query containing sheep to the nearly central regions, due to that most\nqueries about sheep have ground-truth locations at the image center. First, we\nframe the visual grounding pipeline into a causal graph, which shows the\ncausalities among image, query, target location and underlying confounder.\nThrough the causal graph, we know how to break the grounding bottleneck:\ndeconfounded visual grounding. Second, to tackle the challenge that the\nconfounder is unobserved in general, we propose a confounder-agnostic approach\ncalled: Referring Expression Deconfounder (RED), to remove the confounding\nbias. Third, we implement RED as a simple language attention, which can be\napplied in any grounding method. On popular benchmarks, RED improves various\nstate-of-the-art grounding methods by a significant margin. Code will soon be\navailable at: https://github.com/JianqiangH/Deconfounded_VG.", "comment": "AAAI 2022 Accepted", "links": []}
{"entry_id": "2112.08587", "title": "SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning", "authors": ["Zhecan Wang", "Haoxuan You", "Liunian Harold Li", "Alireza Zareian", "Suji Park", "Yiqing Liang", "Kai-Wei Chang", "Shih-Fu Chang"], "published": "2021-12-16 03:16:30", "updated": "2021-12-16 03:16:30", "summary": "Answering complex questions about images is an ambitious goal for machine\nintelligence, which requires a joint understanding of images, text, and\ncommonsense knowledge, as well as a strong reasoning ability. Recently,\nmultimodal Transformers have made great progress in the task of Visual\nCommonsense Reasoning (VCR), by jointly understanding visual objects and text\ntokens through layers of cross-modality attention. However, these approaches do\nnot utilize the rich structure of the scene and the interactions between\nobjects which are essential in answering complex commonsense questions. We\npropose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to\nincorporate visual scene graphs in commonsense reasoning. To exploit the scene\ngraph structure, at the model structure level, we propose a multihop graph\ntransformer for regularizing attention interaction among hops. As for\npre-training, a scene-graph-aware pre-training method is proposed to leverage\nstructure knowledge extracted in the visual scene graph. Moreover, we introduce\na method to train and generate domain-relevant visual scene graphs using\ntextual annotations in a weakly-supervised manner. Extensive experiments on VCR\nand other tasks show a significant performance boost compared with the\nstate-of-the-art methods and prove the efficacy of each proposed component.", "comment": "AAAI 2022", "links": []}
{"entry_id": "2104.11832", "title": "Playing Lottery Tickets with Vision and Language", "authors": ["Zhe Gan", "Yen-Chun Chen", "Linjie Li", "Tianlong Chen", "Yu Cheng", "Shuohang Wang", "Jingjing Liu", "Lijuan Wang", "Zicheng Liu"], "published": "2021-04-23 22:24:33", "updated": "2021-12-14 23:04:45", "summary": "Large-scale pre-training has recently revolutionized vision-and-language (VL)\nresearch. Models such as LXMERT and UNITER have significantly lifted the state\nof the art over a wide range of VL tasks. However, the large number of\nparameters in such models hinders their application in practice. In parallel,\nwork on the lottery ticket hypothesis (LTH) has shown that deep neural networks\ncontain small matching subnetworks that can achieve on par or even better\nperformance than the dense networks when trained in isolation. In this work, we\nperform the first empirical study to assess whether such trainable subnetworks\nalso exist in pre-trained VL models. We use UNITER as the main testbed (also\ntest on LXMERT and ViLT), and consolidate 7 representative VL tasks for\nexperiments, including visual question answering, visual commonsense reasoning,\nvisual entailment, referring expression comprehension, image-text retrieval,\nGQA, and NLVR$^2$. Through comprehensive analysis, we summarize our main\nfindings as follows. ($i$) It is difficult to find subnetworks that strictly\nmatch the performance of the full model. However, we can find \"relaxed\" winning\ntickets at 50%-70% sparsity that maintain 99% of the full accuracy. ($ii$)\nSubnetworks found by task-specific pruning transfer reasonably well to the\nother tasks, while those found on the pre-training tasks at 60%/70% sparsity\ntransfer universally, matching 98%/96% of the full accuracy on average over all\nthe tasks. ($iii$) Besides UNITER, other models such as LXMERT and ViLT can\nalso play lottery tickets. However, the highest sparsity we can achieve for\nViLT is far lower than LXMERT and UNITER (30% vs. 70%). ($iv$) LTH also remains\nrelevant when using other training methods (e.g., adversarial training).", "comment": "Accepted to AAAI 2022", "links": []}
{"entry_id": "2112.07236", "title": "Logics in fungal mycelium networks", "authors": ["Andrew Adamatzky", "Phil Ayres", "Alexander E. Beasley", "Nic Roberts", "Martin Tegelaar", "Michail-Antisthenis Tsompanas", "Han A. B. Wösten"], "published": "2021-12-14 08:58:40", "updated": "2021-12-14 08:58:40", "summary": "The living mycelium networks are capable of efficient sensorial fusion over\nvery large areas and distributed decision making. The information processing in\nthe mycelium networks is implemented via propagation of electrical and chemical\nsignals en pair with morphological changes in the mycelium structure. These\ninformation processing mechanisms are manifested in experimental laboratory\nfindings that show that the mycelium networks exhibit rich dynamics of\nneuron-like spiking behaviour and a wide range of non-linear electrical\nproperties. On an example of a single real colony of \\emph{Aspergillus niger},\nwe demonstrate that the non-linear transformation of electrical signals and\ntrains of extracellular voltage spikes can be used to implement logical gates\nand circuits. The approaches adopted include numerical modelling of excitation\npropagation on the mycelium network, representation of the mycelium network as\na resistive and capacitive (RC) network and an experimental laboratory study on\nmining logical circuits in mycelium bound composites.", "comment": "To be published in special issue of Logica Universalis --- \"Logic,\n  Spatial Algorithms and Visual Reasoning\", edited by Andrew Schumann and Jerzy\n  Kr\\'{o}l, 2022", "links": []}
{"entry_id": "2112.05136", "title": "PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning", "authors": ["Yining Hong", "Li Yi", "Joshua B. Tenenbaum", "Antonio Torralba", "Chuang Gan"], "published": "2021-12-09 18:59:34", "updated": "2021-12-09 18:59:34", "summary": "A critical aspect of human visual perception is the ability to parse visual\nscenes into individual objects and further into object parts, forming\npart-whole hierarchies. Such composite structures could induce a rich set of\nsemantic concepts and relations, thus playing an important role in the\ninterpretation and organization of visual signals as well as for the\ngeneralization of visual perception and reasoning. However, existing visual\nreasoning benchmarks mostly focus on objects rather than parts. Visual\nreasoning based on the full part-whole hierarchy is much more challenging than\nobject-centric reasoning due to finer-grained concepts, richer geometry\nrelations, and more complex physics. Therefore, to better serve for part-based\nconceptual, relational and physical reasoning, we introduce a new large-scale\ndiagnostic visual reasoning dataset named PTR. PTR contains around 70k RGBD\nsynthetic images with ground truth object and part level annotations regarding\nsemantic instance segmentation, color attributes, spatial and geometric\nrelationships, and certain physical properties such as stability. These images\nare paired with 700k machine-generated questions covering various types of\nreasoning types, making them a good testbed for visual reasoning models. We\nexamine several state-of-the-art visual reasoning models on this dataset and\nobserve that they still make many surprising mistakes in situations where\nhumans can easily infer the correct answer. We believe this dataset will open\nup new opportunities for part-based reasoning.", "comment": "NeurIPS 2021. Project page: http://ptr.csail.mit.edu/", "links": []}
{"entry_id": "2108.03603", "title": "Understanding the computational demands underlying visual reasoning", "authors": ["Mohit Vaishnav", "Remi Cadene", "Andrea Alamia", "Drew Linsley", "Rufin VanRullen", "Thomas Serre"], "published": "2021-08-08 10:46:53", "updated": "2021-12-09 04:57:02", "summary": "Visual understanding requires comprehending complex visual relations between\nobjects within a scene. Here, we seek to characterize the computational demands\nfor abstract visual reasoning. We do this by systematically assessing the\nability of modern deep convolutional neural networks (CNNs) to learn to solve\nthe \"Synthetic Visual Reasoning Test\" (SVRT) challenge, a collection of\ntwenty-three visual reasoning problems. Our analysis reveals a novel taxonomy\nof visual reasoning tasks, which can be primarily explained by both the type of\nrelations (same-different vs. spatial-relation judgments) and the number of\nrelations used to compose the underlying rules. Prior cognitive neuroscience\nwork suggests that attention plays a key role in humans' visual reasoning\nability. To test this hypothesis, we extended the CNNs with spatial and\nfeature-based attention mechanisms. In a second series of experiments, we\nevaluated the ability of these attention networks to learn to solve the SVRT\nchallenge and found the resulting architectures to be much more efficient at\nsolving the hardest of these visual reasoning tasks. Most importantly, the\ncorresponding improvements on individual tasks partially explained our novel\ntaxonomy. Overall, this work provides an granular computational account of\nvisual reasoning and yields testable neuroscience predictions regarding the\ndifferential need for feature-based vs. spatial attention depending on the type\nof visual reasoning problem.", "comment": "26 pages, 16 figures", "links": ["http://dx.doi.org/10.1162/neco_a_01485"]}
{"entry_id": "2111.14666", "title": "An in-depth experimental study of sensor usage and visual reasoning of robots navigating in real environments", "authors": ["Assem Sadek", "Guillaume Bono", "Boris Chidlovskii", "Christian Wolf"], "published": "2021-11-29 16:27:29", "updated": "2021-11-29 16:27:29", "summary": "Visual navigation by mobile robots is classically tackled through SLAM plus\noptimal planning, and more recently through end-to-end training of policies\nimplemented as deep networks. While the former are often limited to waypoint\nplanning, but have proven their efficiency even on real physical environments,\nthe latter solutions are most frequently employed in simulation, but have been\nshown to be able learn more complex visual reasoning, involving complex\nsemantical regularities. Navigation by real robots in physical environments is\nstill an open problem. End-to-end training approaches have been thoroughly\ntested in simulation only, with experiments involving real robots being\nrestricted to rare performance evaluations in simplified laboratory conditions.\nIn this work we present an in-depth study of the performance and reasoning\ncapacities of real physical agents, trained in simulation and deployed to two\ndifferent physical environments. Beyond benchmarking, we provide insights into\nthe generalization capabilities of different agents training in different\nconditions. We visualize sensor usage and the importance of the different types\nof signals. We show, that for the PointGoal task, an agent pre-trained on wide\nvariety of tasks and fine-tuned on a simulated version of the target\nenvironment can reach competitive performance without modelling any sim2real\ntransfer, i.e. by deploying the trained agent directly from simulation to a\nreal physical robot.", "comment": null, "links": []}
{"entry_id": "2111.14576", "title": "Recurrent Vision Transformer for Solving Visual Reasoning Problems", "authors": ["Nicola Messina", "Giuseppe Amato", "Fabio Carrara", "Claudio Gennaro", "Fabrizio Falchi"], "published": "2021-11-29 15:01:09", "updated": "2021-11-29 15:01:09", "summary": "Although convolutional neural networks (CNNs) showed remarkable results in\nmany vision tasks, they are still strained by simple yet challenging visual\nreasoning problems. Inspired by the recent success of the Transformer network\nin computer vision, in this paper, we introduce the Recurrent Vision\nTransformer (RViT) model. Thanks to the impact of recurrent connections and\nspatial attention in reasoning tasks, this network achieves competitive results\non the same-different visual reasoning problems from the SVRT dataset. The\nweight-sharing both in spatial and depth dimensions regularizes the model,\nallowing it to learn using far fewer free parameters, using only 28k training\nsamples. A comprehensive ablation study confirms the importance of a hybrid CNN\n+ Transformer architecture and the role of the feedback connections, which\niteratively refine the internal representation until a stable prediction is\nobtained. In the end, this study can lay the basis for a deeper understanding\nof the role of attention and recurrent connections for solving visual abstract\nreasoning tasks.", "comment": null, "links": []}
{"entry_id": "2006.00753", "title": "Structured Multimodal Attentions for TextVQA", "authors": ["Chenyu Gao", "Qi Zhu", "Peng Wang", "Hui Li", "Yuliang Liu", "Anton van den Hengel", "Qi Wu"], "published": "2020-06-01 07:07:36", "updated": "2021-11-26 03:00:58", "summary": "In this paper, we propose an end-to-end structured multimodal attention (SMA)\nneural network to mainly solve the first two issues above. SMA first uses a\nstructural graph representation to encode the object-object, object-text and\ntext-text relationships appearing in the image, and then designs a multimodal\ngraph attention network to reason over it. Finally, the outputs from the above\nmodules are processed by a global-local attentional answering module to produce\nan answer splicing together tokens from both OCR and general vocabulary\niteratively by following M4C. Our proposed model outperforms the SoTA models on\nTextVQA dataset and two tasks of ST-VQA dataset among all models except\npre-training based TAP. Demonstrating strong reasoning ability, it also won\nfirst place in TextVQA Challenge 2020. We extensively test different OCR\nmethods on several reasoning models and investigate the impact of gradually\nincreased OCR performance on TextVQA benchmark. With better OCR results,\ndifferent models share dramatic improvement over the VQA accuracy, but our\nmodel benefits most blessed by strong textual-visual reasoning ability. To\ngrant our method an upper bound and make a fair testing base available for\nfurther works, we also provide human-annotated ground-truth OCR annotations for\nthe TextVQA dataset, which were not given in the original release. The code and\nground-truth OCR annotations for the TextVQA dataset are available at\nhttps://github.com/ChenyuGAO-CS/SMA", "comment": "winner of TextVQA Challenge 2020, Accepted by IEEE Transactions on\n  Pattern Analysis and Machine Intelligence", "links": []}
{"entry_id": "2103.05222", "title": "Data augmentation by morphological mixup for solving Raven's Progressive Matrices", "authors": ["Wentao He", "Jianfeng Ren", "Ruibin Bai"], "published": "2021-03-09 04:50:32", "updated": "2021-11-19 07:37:38", "summary": "Raven's Progressive Matrices (RPMs) are frequently used in testing human's\nvisual reasoning ability. Recent advances of RPM-like datasets and solution\nmodels partially address the challenges of visually understanding the RPM\nquestions and logically reasoning the missing answers. In view of the poor\ngeneralization performance due to insufficient samples in RPM datasets, we\npropose an effective scheme, namely Candidate Answer Morphological Mixup\n(CAM-Mix). CAM-Mix serves as a data augmentation strategy by gray-scale image\nmorphological mixup, which regularizes various solution methods and overcomes\nthe model overfitting problem. By creating new negative candidate answers\nsemantically similar to the correct answers, a more accurate decision boundary\ncould be defined. By applying the proposed data augmentation method, a\nsignificant and consistent performance improvement is achieved on various\nRPM-like datasets compared with the state-of-the-art models.", "comment": "Under review", "links": []}
{"entry_id": "2009.03979", "title": "A Distance-preserving Matrix Sketch", "authors": ["Leland Wilkinson", "Hengrui Luo"], "published": "2020-09-08 20:15:14", "updated": "2021-11-19 06:39:11", "summary": "Visualizing very large matrices involves many formidable problems. Various\npopular solutions to these problems involve sampling, clustering, projection,\nor feature selection to reduce the size and complexity of the original task. An\nimportant aspect of these methods is how to preserve relative distances between\npoints in the higher-dimensional space after reducing rows and columns to fit\nin a lower dimensional space. This aspect is important because conclusions\nbased on faulty visual reasoning can be harmful. Judging dissimilar points as\nsimilar or similar points as dissimilar on the basis of a visualization can\nlead to false conclusions. To ameliorate this bias and to make visualizations\nof very large datasets feasible, we introduce two new algorithms that\nrespectively select a subset of rows and columns of a rectangular matrix. This\nselection is designed to preserve relative distances as closely as possible. We\ncompare our matrix sketch to more traditional alternatives on a variety of\nartificial and real datasets.", "comment": "38 pages, 13 figures", "links": ["http://dx.doi.org/10.1080/10618600.2022.2050246"]}
{"entry_id": "2106.13488", "title": "Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training", "authors": ["Hongwei Xue", "Yupan Huang", "Bei Liu", "Houwen Peng", "Jianlong Fu", "Houqiang Li", "Jiebo Luo"], "published": "2021-06-25 08:04:25", "updated": "2021-11-09 06:27:44", "summary": "Vision-Language Pre-training (VLP) aims to learn multi-modal representations\nfrom image-text pairs and serves for downstream vision-language tasks in a\nfine-tuning fashion. The dominant VLP models adopt a CNN-Transformer\narchitecture, which embeds images with a CNN, and then aligns images and text\nwith a Transformer. Visual relationship between visual contents plays an\nimportant role in image understanding and is the basic for inter-modal\nalignment learning. However, CNNs have limitations in visual relation learning\ndue to local receptive field's weakness in modeling long-range dependencies.\nThus the two objectives of learning visual relation and inter-modal alignment\nare encapsulated in the same Transformer network. Such design might restrict\nthe inter-modal alignment learning in the Transformer by ignoring the\nspecialized characteristic of each objective. To tackle this, we propose a\nfully Transformer visual embedding for VLP to better learn visual relation and\nfurther promote inter-modal alignment. Specifically, we propose a metric named\nInter-Modality Flow (IMF) to measure the interaction between vision and\nlanguage modalities (i.e., inter-modality). We also design a novel masking\noptimization mechanism named Masked Feature Regression (MFR) in Transformer to\nfurther promote the inter-modality learning. To the best of our knowledge, this\nis the first study to explore the benefit of Transformer for visual feature\nlearning in VLP. We verify our method on a wide range of vision-language tasks,\nincluding Image-Text Retrieval, Visual Question Answering (VQA), Visual\nEntailment and Visual Reasoning. Our approach not only outperforms the\nstate-of-the-art VLP performance, but also shows benefits on the IMF metric.", "comment": "Accepted by NeurIPS 2021", "links": []}
{"entry_id": "2102.01916", "title": "Answer Questions with Right Image Regions: A Visual Attention Regularization Approach", "authors": ["Yibing Liu", "Yangyang Guo", "Jianhua Yin", "Xuemeng Song", "Weifeng Liu", "Liqiang Nie"], "published": "2021-02-03 07:33:30", "updated": "2021-11-08 08:28:36", "summary": "Visual attention in Visual Question Answering (VQA) targets at locating the\nright image regions regarding the answer prediction, offering a powerful\ntechnique to promote multi-modal understanding. However, recent studies have\npointed out that the highlighted image regions from the visual attention are\noften irrelevant to the given question and answer, leading to model confusion\nfor correct visual reasoning. To tackle this problem, existing methods mostly\nresort to aligning the visual attention weights with human attentions.\nNevertheless, gathering such human data is laborious and expensive, making it\nburdensome to adapt well-developed models across datasets. To address this\nissue, in this paper, we devise a novel visual attention regularization\napproach, namely AttReg, for better visual grounding in VQA. Specifically,\nAttReg firstly identifies the image regions which are essential for question\nanswering yet unexpectedly ignored (i.e., assigned with low attention weights)\nby the backbone model. And then a mask-guided learning scheme is leveraged to\nregularize the visual attention to focus more on these ignored key regions. The\nproposed method is very flexible and model-agnostic, which can be integrated\ninto most visual attention-based VQA models and require no human attention\nsupervision. Extensive experiments over three benchmark datasets, i.e., VQA-CP\nv2, VQA-CP v1, and VQA v2, have been conducted to evaluate the effectiveness of\nAttReg. As a by-product, when incorporating AttReg into the strong baseline\nLMH, our approach can achieve a new state-of-the-art accuracy of 60.00% with an\nabsolute performance gain of 7.01% on the VQA-CP v2 benchmark dataset...", "comment": "ACM TOMM 2021", "links": []}
{"entry_id": "2110.15358", "title": "Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language", "authors": ["Mingyu Ding", "Zhenfang Chen", "Tao Du", "Ping Luo", "Joshua B. Tenenbaum", "Chuang Gan"], "published": "2021-10-28 17:59:13", "updated": "2021-10-28 17:59:13", "summary": "In this work, we propose a unified framework, called Visual Reasoning with\nDiffer-entiable Physics (VRDP), that can jointly learn visual concepts and\ninfer physics models of objects and their interactions from videos and\nlanguage. This is achieved by seamlessly integrating three components: a visual\nperception module, a concept learner, and a differentiable physics engine. The\nvisual perception module parses each video frame into object-centric\ntrajectories and represents them as latent scene representations. The concept\nlearner grounds visual concepts (e.g., color, shape, and material) from these\nobject-centric representations based on the language, thus providing prior\nknowledge for the physics engine. The differentiable physics model, implemented\nas an impulse-based differentiable rigid-body simulator, performs\ndifferentiable physical simulation based on the grounded concepts to infer\nphysical properties, such as mass, restitution, and velocity, by fitting the\nsimulated trajectories into the video observations. Consequently, these learned\nconcepts and physical models can explain what we have seen and imagine what is\nabout to happen in future and counterfactual scenarios. Integrating\ndifferentiable physics into the dynamic reasoning framework offers several\nappealing benefits. More accurate dynamics prediction in learned physics models\nenables state-of-the-art performance on both synthetic and real-world\nbenchmarks while still maintaining high transparency and interpretability; most\nnotably, VRDP improves the accuracy of predictive and counterfactual questions\nby 4.5% and 11.5% compared to its best counterpart. VRDP is also highly\ndata-efficient: physical parameters can be optimized from very few videos, and\neven a single video can be sufficient. Finally, with all physical parameters\ninferred, VRDP can quickly learn new concepts from a few examples.", "comment": "NeurIPS 2021. Project page: http://vrdp.csail.mit.edu/", "links": []}
{"entry_id": "2012.08508", "title": "Attention over learned object embeddings enables complex visual reasoning", "authors": ["David Ding", "Felix Hill", "Adam Santoro", "Malcolm Reynolds", "Matt Botvinick"], "published": "2020-12-15 18:57:40", "updated": "2021-10-26 15:55:56", "summary": "Neural networks have achieved success in a wide array of perceptual tasks but\noften fail at tasks involving both perception and higher-level reasoning. On\nthese more challenging tasks, bespoke approaches (such as modular symbolic\ncomponents, independent dynamics models or semantic parsers) targeted towards\nthat specific type of task have typically performed better. The downside to\nthese targeted approaches, however, is that they can be more brittle than\ngeneral-purpose neural networks, requiring significant modification or even\nredesign according to the particular task at hand. Here, we propose a more\ngeneral neural-network-based approach to dynamic visual reasoning problems that\nobtains state-of-the-art performance on three different domains, in each case\noutperforming bespoke modular approaches tailored specifically to the task. Our\nmethod relies on learned object-centric representations, self-attention and\nself-supervised dynamics learning, and all three elements together are required\nfor strong performance to emerge. The success of this combination suggests that\nthere may be no need to trade off flexibility for performance on problems\ninvolving spatio-temporal or causal-style reasoning. With the right soft biases\nand learning objectives in a neural network we may be able to attain the best\nof both worlds.", "comment": "22 pages, 5 figures", "links": []}
{"entry_id": "2110.11536", "title": "Neural-guided, Bidirectional Program Search for Abstraction and Reasoning", "authors": ["Simon Alford", "Anshula Gandhi", "Akshay Rangamani", "Andrzej Banburski", "Tony Wang", "Sylee Dandekar", "John Chin", "Tomaso Poggio", "Peter Chin"], "published": "2021-10-22 00:41:47", "updated": "2021-10-26 15:26:31", "summary": "One of the challenges facing artificial intelligence research today is\ndesigning systems capable of utilizing systematic reasoning to generalize to\nnew tasks. The Abstraction and Reasoning Corpus (ARC) measures such a\ncapability through a set of visual reasoning tasks. In this paper we report\nincremental progress on ARC and lay the foundations for two approaches to\nabstraction and reasoning not based in brute-force search. We first apply an\nexisting program synthesis system called DreamCoder to create symbolic\nabstractions out of tasks solved so far, and show how it enables solving of\nprogressively more challenging ARC tasks. Second, we design a reasoning\nalgorithm motivated by the way humans approach ARC. Our algorithm constructs a\nsearch graph and reasons over this graph structure to discover task solutions.\nMore specifically, we extend existing execution-guided program synthesis\napproaches with deductive reasoning based on function inverse semantics to\nenable a neural-guided bidirectional search algorithm. We demonstrate the\neffectiveness of the algorithm on three domains: ARC, 24-Game tasks, and a\n'double-and-add' arithmetic puzzle.", "comment": "Published as a conference paper at Complex Networks 2021", "links": []}
{"entry_id": "2106.02636", "title": "MERLOT: Multimodal Neural Script Knowledge Models", "authors": ["Rowan Zellers", "Ximing Lu", "Jack Hessel", "Youngjae Yu", "Jae Sung Park", "Jize Cao", "Ali Farhadi", "Yejin Choi"], "published": "2021-06-04 17:57:39", "updated": "2021-10-21 23:24:26", "summary": "As humans, we understand events in the visual world contextually, performing\nmultimodal reasoning across time to make inferences about the past, present,\nand future. We introduce MERLOT, a model that learns multimodal script\nknowledge by watching millions of YouTube videos with transcribed speech -- in\nan entirely label-free, self-supervised manner. By pretraining with a mix of\nboth frame-level (spatial) and video-level (temporal) objectives, our model not\nonly learns to match images to temporally corresponding words, but also to\ncontextualize what is happening globally over time. As a result, MERLOT\nexhibits strong out-of-the-box representations of temporal commonsense, and\nachieves state-of-the-art performance on 12 different video QA datasets when\nfinetuned. It also transfers well to the world of static images, allowing\nmodels to reason about the dynamic context behind visual scenes. On Visual\nCommonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy,\noutperforming state-of-the-art models of similar size by over 3%, even those\nthat make heavy use of auxiliary supervised data (like object bounding boxes).\n  Ablation analyses demonstrate the complementary importance of: 1) training on\nvideos versus static images; 2) scaling the magnitude and diversity of the\npretraining video corpus; and 3) using diverse objectives that encourage\nfull-stack multimodal reasoning, from the recognition to cognition level.", "comment": "project page at https://rowanzellers.com/merlot; NeurIPS 2021 camera\n  ready", "links": []}
{"entry_id": "2110.00804", "title": "ProTo: Program-Guided Transformer for Program-Guided Tasks", "authors": ["Zelin Zhao", "Karan Samel", "Binghong Chen", "Le Song"], "published": "2021-10-02 13:46:32", "updated": "2021-10-16 02:14:06", "summary": "Programs, consisting of semantic and structural information, play an\nimportant role in the communication between humans and agents. Towards learning\ngeneral program executors to unify perception, reasoning, and decision making,\nwe formulate program-guided tasks which require learning to execute a given\nprogram on the observed task specification. Furthermore, we propose the\nProgram-guided Transformer (ProTo), which integrates both semantic and\nstructural guidance of a program by leveraging cross-attention and masked\nself-attention to pass messages between the specification and routines in the\nprogram. ProTo executes a program in a learned latent space and enjoys stronger\nrepresentation ability than previous neural-symbolic approaches. We demonstrate\nthat ProTo significantly outperforms the previous state-of-the-art methods on\nGQA visual reasoning and 2D Minecraft policy learning datasets. Additionally,\nProTo demonstrates better generalization to unseen, complex, and human-written\nprograms.", "comment": "Accepted in NeurIPS 2021", "links": []}
{"entry_id": "2110.06399", "title": "Dynamic Inference with Neural Interpreters", "authors": ["Nasim Rahaman", "Muhammad Waleed Gondal", "Shruti Joshi", "Peter Gehler", "Yoshua Bengio", "Francesco Locatello", "Bernhard Schölkopf"], "published": "2021-10-12 23:22:45", "updated": "2021-10-12 23:22:45", "summary": "Modern neural network architectures can leverage large amounts of data to\ngeneralize well within the training distribution. However, they are less\ncapable of systematic generalization to data drawn from unseen but related\ndistributions, a feat that is hypothesized to require compositional reasoning\nand reuse of knowledge. In this work, we present Neural Interpreters, an\narchitecture that factorizes inference in a self-attention network as a system\nof modules, which we call \\emph{functions}. Inputs to the model are routed\nthrough a sequence of functions in a way that is end-to-end learned. The\nproposed architecture can flexibly compose computation along width and depth,\nand lends itself well to capacity extension after training. To demonstrate the\nversatility of Neural Interpreters, we evaluate it in two distinct settings:\nimage classification and visual abstract reasoning on Raven Progressive\nMatrices. In the former, we show that Neural Interpreters perform on par with\nthe vision transformer using fewer parameters, while being transferrable to a\nnew task in a sample efficient manner. In the latter, we find that Neural\nInterpreters are competitive with respect to the state-of-the-art in terms of\nsystematic generalization", "comment": "NeurIPS 2021", "links": []}
{"entry_id": "2012.04932", "title": "Semantically Robust Unpaired Image Translation for Data with Unmatched Semantics Statistics", "authors": ["Zhiwei Jia", "Bodi Yuan", "Kangkang Wang", "Hong Wu", "David Clifford", "Zhiqiang Yuan", "Hao Su"], "published": "2020-12-09 09:28:53", "updated": "2021-10-06 05:27:10", "summary": "Many applications of unpaired image-to-image translation require the input\ncontents to be preserved semantically during translations. Unaware of the\ninherently unmatched semantics distributions between source and target domains,\nexisting distribution matching methods (i.e., GAN-based) can give undesired\nsolutions. In particular, although producing visually reasonable outputs, the\nlearned models usually flip the semantics of the inputs. To tackle this without\nusing extra supervision, we propose to enforce the translated outputs to be\nsemantically invariant w.r.t. small perceptual variations of the inputs, a\nproperty we call \"semantic robustness\". By optimizing a robustness loss w.r.t.\nmulti-scale feature space perturbations of the inputs, our method effectively\nreduces semantics flipping and produces translations that outperform existing\nmethods both quantitatively and qualitatively.", "comment": "Accepted to ICCV 2021", "links": []}
{"entry_id": "2109.06860", "title": "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning", "authors": ["Da Yin", "Liunian Harold Li", "Ziniu Hu", "Nanyun Peng", "Kai-Wei Chang"], "published": "2021-09-14 17:52:55", "updated": "2021-09-14 17:52:55", "summary": "Commonsense is defined as the knowledge that is shared by everyone. However,\ncertain types of commonsense knowledge are correlated with culture and\ngeographic locations and they are only shared locally. For example, the\nscenarios of wedding ceremonies vary across regions due to different customs\ninfluenced by historical and religious factors. Such regional characteristics,\nhowever, are generally omitted in prior work. In this paper, we construct a\nGeo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test\nvision-and-language models' ability to understand cultural and\ngeo-location-specific commonsense. In particular, we study two state-of-the-art\nVision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard\nmultimodal commonsense benchmark with images primarily from Western regions. We\nthen evaluate how well the trained models can generalize to answering the\nquestions in GD-VCR. We find that the performance of both models for\nnon-Western regions including East Asia, South Asia, and Africa is\nsignificantly lower than that for Western region. We analyze the reasons behind\nthe performance disparity and find that the performance gap is larger on QA\npairs that: 1) are concerned with culture-related scenarios, e.g., weddings,\nreligious activities, and festivals; 2) require high-level geo-diverse\ncommonsense reasoning rather than low-order perception and recognition. Dataset\nand code are released at https://github.com/WadeYin9712/GD-VCR.", "comment": "EMNLP 2021. Code and data are available at\n  https://github.com/WadeYin9712/GD-VCR", "links": []}
{"entry_id": "2109.01934", "title": "Weakly Supervised Relative Spatial Reasoning for Visual Question Answering", "authors": ["Pratyay Banerjee", "Tejas Gokhale", "Yezhou Yang", "Chitta Baral"], "published": "2021-09-04 21:29:06", "updated": "2021-09-04 21:29:06", "summary": "Vision-and-language (V\\&L) reasoning necessitates perception of visual\nconcepts such as objects and actions, understanding semantics and language\ngrounding, and reasoning about the interplay between the two modalities. One\ncrucial aspect of visual reasoning is spatial understanding, which involves\nunderstanding relative locations of objects, i.e.\\ implicitly learning the\ngeometry of the scene. In this work, we evaluate the faithfulness of V\\&L\nmodels to such geometric understanding, by formulating the prediction of\npair-wise relative locations of objects as a classification as well as a\nregression task. Our findings suggest that state-of-the-art transformer-based\nV\\&L models lack sufficient abilities to excel at this task. Motivated by this,\nwe design two objectives as proxies for 3D spatial reasoning (SR) -- object\ncentroid estimation, and relative position estimation, and train V\\&L with weak\nsupervision from off-the-shelf depth estimators. This leads to considerable\nimprovements in accuracy for the \"GQA\" visual question answering challenge (in\nfully supervised, few-shot, and O.O.D settings) as well as improvements in\nrelative spatial reasoning. Code and data will be released\n\\href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.", "comment": "Accepted to ICCV 2021. PaperId : ICCV2021-10857 Copyright transferred\n  to IEEE ICCV. DOI will be updated later", "links": []}
{"entry_id": "2106.08503", "title": "Understanding and Evaluating Racial Biases in Image Captioning", "authors": ["Dora Zhao", "Angelina Wang", "Olga Russakovsky"], "published": "2021-06-16 01:07:24", "updated": "2021-08-30 15:07:38", "summary": "Image captioning is an important task for benchmarking visual reasoning and\nfor enabling accessibility for people with vision impairments. However, as in\nmany machine learning settings, social biases can influence image captioning in\nundesirable ways. In this work, we study bias propagation pathways within image\ncaptioning, focusing specifically on the COCO dataset. Prior work has analyzed\ngender bias in captions using automatically-derived gender labels; here we\nexamine racial and intersectional biases using manual annotations. Our first\ncontribution is in annotating the perceived gender and skin color of 28,315 of\nthe depicted people after obtaining IRB approval. Using these annotations, we\ncompare racial biases present in both manual and automatically-generated image\ncaptions. We demonstrate differences in caption performance, sentiment, and\nword choice between images of lighter versus darker-skinned people. Further, we\nfind the magnitude of these differences to be greater in modern captioning\nsystems compared to older ones, thus leading to concerns that without proper\nconsideration and mitigation these differences will only become increasingly\nprevalent. Code and data is available at\nhttps://princetonvisualai.github.io/imagecaptioning-bias .", "comment": "ICCV 2021", "links": []}
{"entry_id": "2011.11603", "title": "Interpretable Visual Reasoning via Induced Symbolic Space", "authors": ["Zhonghao Wang", "Kai Wang", "Mo Yu", "Jinjun Xiong", "Wen-mei Hwu", "Mark Hasegawa-Johnson", "Humphrey Shi"], "published": "2020-11-23 18:21:49", "updated": "2021-08-24 13:55:14", "summary": "We study the problem of concept induction in visual reasoning, i.e.,\nidentifying concepts and their hierarchical relationships from question-answer\npairs associated with images; and achieve an interpretable model via working on\nthe induced symbolic concept space. To this end, we first design a new\nframework named object-centric compositional attention model (OCCAM) to perform\nthe visual reasoning task with object-level visual features. Then, we come up\nwith a method to induce concepts of objects and relations using clues from the\nattention patterns between objects' visual features and question words.\nFinally, we achieve a higher level of interpretability by imposing OCCAM on the\nobjects represented in the induced symbolic concept space. Our model design\nmakes this an easy adaption via first predicting the concepts of objects and\nrelations and then projecting the predicted concepts back to the visual feature\nspace so the compositional reasoning module can process normally. Experiments\non the CLEVR and GQA datasets demonstrate: 1) our OCCAM achieves a new state of\nthe art without human-annotated functional programs; 2) our induced concepts\nare both accurate and sufficient as OCCAM achieves an on-par performance on\nobjects represented either in visual features or in the induced symbolic\nconcept space.", "comment": "ICCV 2021", "links": []}
{"entry_id": "2108.08217", "title": "X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics", "authors": ["Yehao Li", "Yingwei Pan", "Jingwen Chen", "Ting Yao", "Tao Mei"], "published": "2021-08-18 16:05:30", "updated": "2021-08-18 16:05:30", "summary": "With the rise and development of deep learning over the past decade, there\nhas been a steady momentum of innovation and breakthroughs that convincingly\npush the state-of-the-art of cross-modal analytics between vision and language\nin multimedia field. Nevertheless, there has not been an open-source codebase\nin support of training and deploying numerous neural network models for\ncross-modal analytics in a unified and modular fashion. In this work, we\npropose X-modaler -- a versatile and high-performance codebase that\nencapsulates the state-of-the-art cross-modal analytics into several\ngeneral-purpose stages (e.g., pre-processing, encoder, cross-modal interaction,\ndecoder, and decode strategy). Each stage is empowered with the functionality\nthat covers a series of modules widely adopted in state-of-the-arts and allows\nseamless switching in between. This way naturally enables a flexible\nimplementation of state-of-the-art algorithms for image captioning, video\ncaptioning, and vision-language pre-training, aiming to facilitate the rapid\ndevelopment of research community. Meanwhile, since the effective modular\ndesigns in several stages (e.g., cross-modal interaction) are shared across\ndifferent vision-language tasks, X-modaler can be simply extended to power\nstartup prototypes for other tasks in cross-modal analytics, including visual\nquestion answering, visual commonsense reasoning, and cross-modal retrieval.\nX-modaler is an Apache-licensed codebase, and its source codes, sample projects\nand pre-trained models are available on-line:\nhttps://github.com/YehLi/xmodaler.", "comment": "Accepted by 2021 ACMMM Open Source Software Competition. Source code:\n  https://github.com/YehLi/xmodaler", "links": []}
{"entry_id": "2108.04024", "title": "Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models", "authors": ["Zheyuan Liu", "Cristian Rodriguez-Opazo", "Damien Teney", "Stephen Gould"], "published": "2021-08-09 13:25:06", "updated": "2021-08-09 13:25:06", "summary": "We extend the task of composed image retrieval, where an input query consists\nof an image and short textual description of how to modify the image. Existing\nmethods have only been applied to non-complex images within narrow domains,\nsuch as fashion products, thereby limiting the scope of study on in-depth\nvisual reasoning in rich image and language contexts. To address this issue, we\ncollect the Compose Image Retrieval on Real-life images (CIRR) dataset, which\nconsists of over 36,000 pairs of crowd-sourced, open-domain images with\nhuman-generated modifying text. To extend current methods to the open-domain,\nwe propose CIRPLANT, a transformer based model that leverages rich pre-trained\nvision-and-language (V&L) knowledge for modifying visual features conditioned\non natural language. Retrieval is then done by nearest neighbor lookup on the\nmodified features. We demonstrate that with a relatively simple architecture,\nCIRPLANT outperforms existing methods on open-domain images, while matching\nstate-of-the-art accuracy on the existing narrow datasets, such as fashion.\nTogether with the release of CIRR, we believe this work will inspire further\nresearch on composed image retrieval.", "comment": "ICCV 2021. Dataset, code, and pre-trained models are released at\n  https://cuberick-orion.github.io/CIRR/", "links": []}
{"entry_id": "2107.02334", "title": "Effect of uncertainty visualizations on myopic loss aversion and equity premium puzzle in retirement investment decisions", "authors": ["Ryan Wesslen", "Alireza Karduni", "Douglas Markant", "Wenwen Dou"], "published": "2021-07-06 01:06:27", "updated": "2021-07-27 13:37:46", "summary": "For many households, investing for retirement is one of the most significant\ndecisions and is fraught with uncertainty. In a classic study in behavioral\neconomics, Benartzi and Thaler (1999) found evidence using bar charts that\ninvestors exhibit myopic loss aversion in retirement decisions: Investors\noverly focus on the potential for short-term losses, leading them to invest\nless in riskier assets and miss out on higher long-term returns. Recently,\nadvances in uncertainty visualizations have shown improvements in\ndecision-making under uncertainty in a variety of tasks. In this paper, we\nconduct a controlled and incentivized crowdsourced experiment replicating\nBenartzi and Thaler (1999) and extending it to measure the effect of different\nuncertainty representations on myopic loss aversion. Consistent with the\noriginal study, we find evidence of myopic loss aversion with bar charts and\nfind that participants make better investment decisions with longer evaluation\nperiods. We also find that common uncertainty representations such as interval\nplots and bar charts achieve the highest mean expected returns while other\nuncertainty visualizations lead to poorer long-term performance and strong\neffects on the equity premium. Qualitative feedback further suggests that\ndifferent uncertainty representations lead to visual reasoning heuristics that\ncan either mitigate or encourage a focus on potential short-term losses. We\ndiscuss implications of our results on using uncertainty visualizations for\nretirement decisions in practice and possible extensions for future work.", "comment": "To be published in TVCG Special Issue on the 2021 IEEE Visualization\n  Conference (VIS)", "links": []}
{"entry_id": "2106.03089", "title": "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", "authors": ["Muchen Li", "Leonid Sigal"], "published": "2021-06-06 10:53:39", "updated": "2021-07-14 12:22:08", "summary": "As an important step towards visual reasoning, visual grounding (e.g., phrase\nlocalization, referring expression comprehension/segmentation) has been widely\nexplored Previous approaches to referring expression comprehension (REC) or\nsegmentation (RES) either suffer from limited performance, due to a two-stage\nsetup, or require the designing of complex task-specific one-stage\narchitectures. In this paper, we propose a simple one-stage multi-task\nframework for visual grounding tasks. Specifically, we leverage a transformer\narchitecture, where two modalities are fused in a visual-lingual encoder. In\nthe decoder, the model learns to generate contextualized lingual queries which\nare then decoded and used to directly regress the bounding box and produce a\nsegmentation mask for the corresponding referred regions. With this simple but\nhighly contextualized model, we outperform state-of-the-arts methods by a large\nmargin on both REC and RES tasks. We also show that a simple pre-training\nschedule (on an external dataset) further improves the performance. Extensive\nexperiments and ablations illustrate that our model benefits greatly from\ncontextualized information and multi-task training.", "comment": null, "links": []}
{"entry_id": "2107.05833", "title": "Enforcing Consistency in Weakly Supervised Semantic Parsing", "authors": ["Nitish Gupta", "Sameer Singh", "Matt Gardner"], "published": "2021-07-13 03:48:04", "updated": "2021-07-13 03:48:04", "summary": "The predominant challenge in weakly supervised semantic parsing is that of\nspurious programs that evaluate to correct answers for the wrong reasons. Prior\nwork uses elaborate search strategies to mitigate the prevalence of spurious\nprograms; however, they typically consider only one input at a time. In this\nwork we explore the use of consistency between the output programs for related\ninputs to reduce the impact of spurious programs. We bias the program search\n(and thus the model's training signal) towards programs that map the same\nphrase in related inputs to the same sub-parts in their respective programs.\nAdditionally, we study the importance of designing logical formalisms that\nfacilitate this kind of consAistency-based training. We find that a more\nconsistent formalism leads to improved model performance even without\nconsistency-based training. When combined together, these two insights lead to\na 10% absolute improvement over the best prior result on the Natural Language\nVisual Reasoning dataset.", "comment": "Published in ACL 2021", "links": []}
{"entry_id": "2106.11072", "title": "Techniques for Symbol Grounding with SATNet", "authors": ["Sever Topan", "David Rolnick", "Xujie Si"], "published": "2021-06-16 18:42:12", "updated": "2021-06-16 18:42:12", "summary": "Many experts argue that the future of artificial intelligence is limited by\nthe field's ability to integrate symbolic logical reasoning into deep learning\narchitectures. The recently proposed differentiable MAXSAT solver, SATNet, was\na breakthrough in its capacity to integrate with a traditional neural network\nand solve visual reasoning problems. For instance, it can learn the rules of\nSudoku purely from image examples. Despite its success, SATNet was shown to\nsuccumb to a key challenge in neurosymbolic systems known as the Symbol\nGrounding Problem: the inability to map visual inputs to symbolic variables\nwithout explicit supervision (\"label leakage\"). In this work, we present a\nself-supervised pre-training pipeline that enables SATNet to overcome this\nlimitation, thus broadening the class of problems that SATNet architectures can\nsolve to include datasets where no intermediary labels are available at all. We\ndemonstrate that our method allows SATNet to attain full accuracy even with a\nharder problem setup that prevents any label leakage. We additionally introduce\na proofreading method that further improves the performance of SATNet\narchitectures, beating the state-of-the-art on Visual Sudoku.", "comment": "Code available at https://github.com/SeverTopan/SATNet", "links": []}
{"entry_id": "2102.02779", "title": "Unifying Vision-and-Language Tasks via Text Generation", "authors": ["Jaemin Cho", "Jie Lei", "Hao Tan", "Mohit Bansal"], "published": "2021-02-04 17:59:30", "updated": "2021-05-23 23:12:46", "summary": "Existing methods for vision-and-language learning typically require designing\ntask-specific architectures and objectives for each task. For example, a\nmulti-label answer classifier for visual question answering, a region scorer\nfor referring expression comprehension, and a language decoder for image\ncaptioning, etc. To alleviate these hassles, in this work, we propose a unified\nframework that learns different tasks in a single architecture with the same\nlanguage modeling objective, i.e., multimodal conditional text generation,\nwhere our models learn to generate labels in text based on the visual and\ntextual inputs. On 7 popular vision-and-language benchmarks, including visual\nquestion answering, referring expression comprehension, visual commonsense\nreasoning, most of which have been previously modeled as discriminative tasks,\nour generative approach (with a single unified architecture) reaches comparable\nperformance to recent task-specific state-of-the-art vision-and-language\nmodels. Moreover, our generative approach shows better generalization ability\non questions that have rare answers. Also, we show that our framework allows\nmulti-task learning in a single architecture with a single set of parameters,\nachieving similar performance to separately optimized single-task models. Our\ncode is publicly available at: https://github.com/j-min/VL-T5", "comment": "ICML 2021 (15 pages, 4 figures, 14 tables)", "links": []}
{"entry_id": "2105.02061", "title": "Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention", "authors": ["Wei Suo", "Mengyang Sun", "Peng Wang", "Qi Wu"], "published": "2021-05-05 13:53:53", "updated": "2021-05-05 13:53:53", "summary": "Referring Expression Comprehension (REC) has become one of the most important\ntasks in visual reasoning, since it is an essential step for many\nvision-and-language tasks such as visual question answering. However, it has\nnot been widely used in many downstream tasks because it suffers 1) two-stage\nmethods exist heavy computation cost and inevitable error accumulation, and 2)\none-stage methods have to depend on lots of hyper-parameters (such as anchors)\nto generate bounding box. In this paper, we present a proposal-free one-stage\n(PFOS) model that is able to regress the region-of-interest from the image,\nbased on a textual query, in an end-to-end manner. Instead of using the\ndominant anchor proposal fashion, we directly take the dense-grid of an image\nas input for a cross-attention transformer that learns grid-word\ncorrespondences. The final bounding box is predicted directly from the image\nwithout the time-consuming anchor selection process that previous methods\nsuffer. Our model achieves the state-of-the-art performance on four referring\nexpression datasets with higher efficiency, comparing to previous best\none-stage and two-stage methods.", "comment": "To be published in the 30th International Joint Conference on\n  Artificial Intelligence (IJCAI-2021)", "links": []}
{"entry_id": "2104.14741", "title": "Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads", "authors": ["Chenyu Gao", "Qi Zhu", "Peng Wang", "Qi Wu"], "published": "2021-04-30 03:32:02", "updated": "2021-04-30 03:32:02", "summary": "Vision-and-Language (VL) pre-training has shown great potential on many\nrelated downstream tasks, such as Visual Question Answering (VQA), one of the\nmost popular problems in the VL field. All of these pre-trained models (such as\nVisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which\nextends the classical attention mechanism to multiple layers and heads. To\ninvestigate why and how these models work on VQA so well, in this paper we\nexplore the roles of individual heads and layers in Transformer models when\nhandling $12$ different types of questions. Specifically, we manually remove\n(chop) heads (or layers) from a pre-trained VisualBERT model at a time, and\ntest it on different levels of questions to record its performance. As shown in\nthe interesting echelon shape of the result matrices, experiments reveal\ndifferent heads and layers are responsible for different question types, with\nhigher-level layers activated by higher-level visual reasoning questions. Based\non this observation, we design a dynamic chopping module that can automatically\nremove heads and layers of the VisualBERT at an instance level when dealing\nwith different questions. Our dynamic chopping module can effectively reduce\nthe parameters of the original model by 50%, while only damaging the accuracy\nby less than 1% on the VQA task.", "comment": "14 pages", "links": []}
{"entry_id": "2104.14102", "title": "Comparing Visual Reasoning in Humans and AI", "authors": ["Shravan Murlidaran", "William Yang Wang", "Miguel P. Eckstein"], "published": "2021-04-29 04:44:13", "updated": "2021-04-29 04:44:13", "summary": "Recent advances in natural language processing and computer vision have led\nto AI models that interpret simple scenes at human levels. Yet, we do not have\na complete understanding of how humans and AI models differ in their\ninterpretation of more complex scenes. We created a dataset of complex scenes\nthat contained human behaviors and social interactions. AI and humans had to\ndescribe the scenes with a sentence. We used a quantitative metric of\nsimilarity between scene descriptions of the AI/human and ground truth of five\nother human descriptions of each scene. Results show that the machine/human\nagreement scene descriptions are much lower than human/human agreement for our\ncomplex scenes. Using an experimental manipulation that occludes different\nspatial regions of the scenes, we assessed how machines and humans vary in\nutilizing regions of images to understand the scenes. Together, our results are\na first step toward understanding how machines fall short of human visual\nreasoning with complex scenes depicting human behaviors.", "comment": null, "links": []}
{"entry_id": "2004.09406", "title": "Five Points to Check when Comparing Visual Perception in Humans and Machines", "authors": ["Christina M. Funke", "Judy Borowski", "Karolina Stosio", "Wieland Brendel", "Thomas S. A. Wallis", "Matthias Bethge"], "published": "2020-04-20 16:05:36", "updated": "2021-04-13 16:03:20", "summary": "With the rise of machines to human-level performance in complex recognition\ntasks, a growing amount of work is directed towards comparing information\nprocessing in humans and machines. These studies are an exciting chance to\nlearn about one system by studying the other. Here, we propose ideas on how to\ndesign, conduct and interpret experiments such that they adequately support the\ninvestigation of mechanisms when comparing human and machine perception. We\ndemonstrate and apply these ideas through three case studies. The first case\nstudy shows how human bias can affect how we interpret results, and that\nseveral analytic tools can help to overcome this human reference point. In the\nsecond case study, we highlight the difference between necessary and sufficient\nmechanisms in visual reasoning tasks. Thereby, we show that contrary to\nprevious suggestions, feedback mechanisms might not be necessary for the tasks\nin question. The third case study highlights the importance of aligning\nexperimental conditions. We find that a previously-observed difference in\nobject recognition does not hold when adapting the experiment to make\nconditions more equitable between humans and machines. In presenting a\nchecklist for comparative studies of visual reasoning in humans and machines,\nwe hope to highlight how to overcome potential pitfalls in design or inference.", "comment": "V3: minor changes like in published JOV version\n  (https://doi.org/10.1167/jov.21.3.16) V2: New title; added general section\n  (checklist); manuscript restructured such that each case study is one\n  chapter; adversarial examples in first study replaced by different analysis", "links": ["http://dx.doi.org/10.1167/jov.21.3.16"]}
{"entry_id": "2011.13160", "title": "Transformation Driven Visual Reasoning", "authors": ["Xin Hong", "Yanyan Lan", "Liang Pang", "Jiafeng Guo", "Xueqi Cheng"], "published": "2020-11-26 07:11:31", "updated": "2021-04-02 06:25:46", "summary": "This paper defines a new visual reasoning paradigm by introducing an\nimportant factor, i.e.~transformation. The motivation comes from the fact that\nmost existing visual reasoning tasks, such as CLEVR in VQA, are solely defined\nto test how well the machine understands the concepts and relations within\nstatic settings, like one image. We argue that this kind of \\textbf{state\ndriven visual reasoning} approach has limitations in reflecting whether the\nmachine has the ability to infer the dynamics between different states, which\nhas been shown as important as state-level reasoning for human cognition in\nPiaget's theory. To tackle this problem, we propose a novel\n\\textbf{transformation driven visual reasoning} task. Given both the initial\nand final states, the target is to infer the corresponding single-step or\nmulti-step transformation, represented as a triplet (object, attribute, value)\nor a sequence of triplets, respectively. Following this definition, a new\ndataset namely TRANCE is constructed on the basis of CLEVR, including three\nlevels of settings, i.e.~Basic (single-step transformation), Event (multi-step\ntransformation), and View (multi-step transformation with variant views).\nExperimental results show that the state-of-the-art visual reasoning models\nperform well on Basic, but are still far from human-level intelligence on Event\nand View. We believe the proposed new paradigm will boost the development of\nmachine visual reasoning. More advanced methods and real data need to be\ninvestigated in this direction. The resource of TVR is available at\nhttps://hongxin2019.github.io/TVR.", "comment": "Accepted to CVPR 2021. Resources including the TRANCE dataset and the\n  code can be found at our homepage https://hongxin2019.github.io/TVR", "links": []}
{"entry_id": "2103.16564", "title": "Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning", "authors": ["Zhenfang Chen", "Jiayuan Mao", "Jiajun Wu", "Kwan-Yee Kenneth Wong", "Joshua B. Tenenbaum", "Chuang Gan"], "published": "2021-03-30 17:59:48", "updated": "2021-03-30 17:59:48", "summary": "We study the problem of dynamic visual reasoning on raw videos. This is a\nchallenging problem; currently, state-of-the-art models often require dense\nsupervision on physical object properties and events from simulation, which are\nimpractical to obtain in real life. In this paper, we present the Dynamic\nConcept Learner (DCL), a unified framework that grounds physical objects and\nevents from video and language. DCL first adopts a trajectory extractor to\ntrack each object over time and to represent it as a latent, object-centric\nfeature vector. Building upon this object-centric representation, DCL learns to\napproximate the dynamic interaction among objects using graph networks. DCL\nfurther incorporates a semantic parser to parse questions into semantic\nprograms and, finally, a program executor to run the program to answer the\nquestion, levering the learned dynamics model. After training, DCL can detect\nand associate objects across the frames, ground visual properties, and physical\nevents, understand the causal relationship between events, make future and\ncounterfactual predictions, and leverage these extracted presentations for\nanswering queries. DCL achieves state-of-the-art performance on CLEVRER, a\nchallenging causal video reasoning dataset, even without using ground-truth\nattributes and collision labels from simulations for training. We further test\nDCL on a newly proposed video-retrieval and event localization dataset derived\nfrom CLEVRER, showing its strong generalization capacity.", "comment": "ICLR 2021. Project page: http://dcl.csail.mit.edu/", "links": []}
{"entry_id": "2103.16002", "title": "AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning", "authors": ["Madeleine Grunde-McLaughlin", "Ranjay Krishna", "Maneesh Agrawala"], "published": "2021-03-30 00:24:01", "updated": "2021-03-30 00:24:01", "summary": "Visual events are a composition of temporal actions involving actors\nspatially interacting with objects. When developing computer vision models that\ncan reason about compositional spatio-temporal events, we need benchmarks that\ncan analyze progress and uncover shortcomings. Existing video question\nanswering benchmarks are useful, but they often conflate multiple sources of\nerror into one accuracy metric and have strong biases that models can exploit,\nmaking it difficult to pinpoint model weaknesses. We present Action Genome\nQuestion Answering (AGQA), a new benchmark for compositional spatio-temporal\nreasoning. AGQA contains $192M$ unbalanced question answer pairs for $9.6K$\nvideos. We also provide a balanced subset of $3.9M$ question answer pairs, $3$\norders of magnitude larger than existing benchmarks, that minimizes bias by\nbalancing the answer distributions and types of question structures. Although\nhuman evaluators marked $86.02\\%$ of our question-answer pairs as correct, the\nbest model achieves only $47.74\\%$ accuracy. In addition, AGQA introduces\nmultiple training/test splits to test for various reasoning abilities,\nincluding generalization to novel compositions, to indirect references, and to\nmore compositional steps. Using AGQA, we evaluate modern visual reasoning\nsystems, demonstrating that the best models barely perform better than\nnon-visual baselines exploiting linguistic biases and that none of the existing\nmodels generalize to novel compositions unseen during training.", "comment": "8 pages, 15 pages supplementary, 12 figures. To be published in CVPR\n  2021", "links": []}
{"entry_id": "2103.14232", "title": "ACRE: Abstract Causal REasoning Beyond Covariation", "authors": ["Chi Zhang", "Baoxiong Jia", "Mark Edmonds", "Song-Chun Zhu", "Yixin Zhu"], "published": "2021-03-26 02:42:38", "updated": "2021-03-26 02:42:38", "summary": "Causal induction, i.e., identifying unobservable mechanisms that lead to the\nobservable relations among variables, has played a pivotal role in modern\nscientific discovery, especially in scenarios with only sparse and limited\ndata. Humans, even young toddlers, can induce causal relationships surprisingly\nwell in various settings despite its notorious difficulty. However, in contrast\nto the commonplace trait of human cognition is the lack of a diagnostic\nbenchmark to measure causal induction for modern Artificial Intelligence (AI)\nsystems. Therefore, in this work, we introduce the Abstract Causal REasoning\n(ACRE) dataset for systematic evaluation of current vision systems in causal\ninduction. Motivated by the stream of research on causal discovery in Blicket\nexperiments, we query a visual reasoning system with the following four types\nof questions in either an independent scenario or an interventional scenario:\ndirect, indirect, screening-off, and backward-blocking, intentionally going\nbeyond the simple strategy of inducing causal relationships by covariation. By\nanalyzing visual reasoning architectures on this testbed, we notice that pure\nneural models tend towards an associative strategy under their chance-level\nperformance, whereas neuro-symbolic combinations struggle in backward-blocking\nreasoning. These deficiencies call for future research in models with a more\ncomprehensive capability of causal induction.", "comment": "CVPR 2021 paper. Supplementary:\n  http://wellyzhang.github.io/attach/cvpr21zhang_acre_supp.pdf Project:\n  http://wellyzhang.github.io/project/acre.html", "links": []}
{"entry_id": "2004.14603", "title": "Dynamic Language Binding in Relational Visual Reasoning", "authors": ["Thao Minh Le", "Vuong Le", "Svetha Venkatesh", "Truyen Tran"], "published": "2020-04-30 06:26:20", "updated": "2021-02-18 03:35:24", "summary": "We present Language-binding Object Graph Network, the first neural reasoning\nmethod with dynamic relational structures across both visual and textual\ndomains with applications in visual question answering. Relaxing the common\nassumption made by current models that the object predicates pre-exist and stay\nstatic, passive to the reasoning process, we propose that these dynamic\npredicates expand across the domain borders to include pair-wise\nvisual-linguistic object binding. In our method, these contextualized object\nlinks are actively found within each recurrent reasoning step without relying\non external predicative priors. These dynamic structures reflect the\nconditional dual-domain object dependency given the evolving context of the\nreasoning through co-attention. Such discovered dynamic graphs facilitate\nmulti-step knowledge combination and refinements that iteratively deduce the\ncompact representation of the final answer. The effectiveness of this model is\ndemonstrated on image question answering demonstrating favorable performance on\nmajor VQA datasets. Our method outperforms other methods in sophisticated\nquestion-answering tasks wherein multiple object relations are involved. The\ngraph structure effectively assists the progress of training, and therefore the\nnetwork learns efficiently compared to other reasoning models.", "comment": "Early version accepted by IJCAI20, Code available at\n  https://github.com/thaolmk54/LOGNet-VQA", "links": []}
{"entry_id": "2101.06013", "title": "Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge", "authors": ["Violetta Shevchenko", "Damien Teney", "Anthony Dick", "Anton van den Hengel"], "published": "2021-01-15 08:37:55", "updated": "2021-01-15 08:37:55", "summary": "The limits of applicability of vision-and-language models are defined by the\ncoverage of their training data. Tasks like vision question answering (VQA)\noften require commonsense and factual information beyond what can be learned\nfrom task-specific datasets. This paper investigates the injection of knowledge\nfrom general-purpose knowledge bases (KBs) into vision-and-language\ntransformers. We use an auxiliary training objective that encourages the\nlearned representations to align with graph embeddings of matching entities in\na KB. We empirically study the relevance of various KBs to multiple tasks and\nbenchmarks. The technique brings clear benefits to knowledge-demanding question\nanswering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge\nabsent from existing models. More surprisingly, the technique also benefits\nvisual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and\nshow that the injection of additional knowledge regularizes the space of\nembeddings, which improves the representation of lexical and semantic\nsimilarities. The technique is model-agnostic and can expand the applicability\nof any vision-and-language transformer with minimal computational overhead.", "comment": null, "links": []}
{"entry_id": "2010.00763", "title": "Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning", "authors": ["Weili Nie", "Zhiding Yu", "Lei Mao", "Ankit B. Patel", "Yuke Zhu", "Animashree Anandkumar"], "published": "2020-10-02 03:19:46", "updated": "2021-01-04 21:50:06", "summary": "Humans have an inherent ability to learn novel concepts from only a few\nsamples and generalize these concepts to different situations. Even though\ntoday's machine learning models excel with a plethora of training data on\nstandard recognition tasks, a considerable gap exists between machine-level\npattern recognition and human-level concept learning. To narrow this gap, the\nBongard problems (BPs) were introduced as an inspirational challenge for visual\ncognition in intelligent systems. Despite new advances in representation\nlearning and learning to learn, BPs remain a daunting challenge for modern AI.\nInspired by the original one hundred BPs, we propose a new benchmark\nBongard-LOGO for human-level concept learning and reasoning. We develop a\nprogram-guided generation technique to produce a large set of\nhuman-interpretable visual cognition problems in action-oriented LOGO language.\nOur benchmark captures three core properties of human cognition: 1)\ncontext-dependent perception, in which the same object may have disparate\ninterpretations given different contexts; 2) analogy-making perception, in\nwhich some meaningful concepts are traded off for other meaningful concepts;\nand 3) perception with a few samples but infinite vocabulary. In experiments,\nwe show that the state-of-the-art deep learning methods perform substantially\nworse than human subjects, implying that they fail to capture core human\ncognition properties. Finally, we discuss research directions towards a general\narchitecture for visual reasoning to tackle this benchmark.", "comment": "22 pages, NeurIPS 2020", "links": []}
{"entry_id": "2011.13406", "title": "Learning from Lexical Perturbations for Consistent Visual Question Answering", "authors": ["Spencer Whitehead", "Hui Wu", "Yi Ren Fung", "Heng Ji", "Rogerio Feris", "Kate Saenko"], "published": "2020-11-26 17:38:03", "updated": "2020-12-23 00:29:27", "summary": "Existing Visual Question Answering (VQA) models are often fragile and\nsensitive to input variations. In this paper, we propose a novel approach to\naddress this issue based on modular networks, which creates two questions\nrelated by linguistic perturbations and regularizes the visual reasoning\nprocess between them to be consistent during training. We show that our\nframework markedly improves consistency and generalization ability,\ndemonstrating the value of controlled linguistic perturbations as a useful and\ncurrently underutilized training and regularization tool for VQA models. We\nalso present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and\naugmentation pipeline to create controllable linguistic variations of VQA\nquestions. Our benchmark uniquely draws from large-scale linguistic resources,\navoiding human annotation effort while maintaining data quality compared to\ngenerative approaches. We benchmark existing VQA models using VQA P2 and\nprovide robustness analysis on each type of linguistic variation.", "comment": "14 pages, 8 figures", "links": []}
{"entry_id": "2012.11587", "title": "Object-Centric Diagnosis of Visual Reasoning", "authors": ["Jianwei Yang", "Jiayuan Mao", "Jiajun Wu", "Devi Parikh", "David D. Cox", "Joshua B. Tenenbaum", "Chuang Gan"], "published": "2020-12-21 18:59:28", "updated": "2020-12-21 18:59:28", "summary": "When answering questions about an image, it not only needs knowing what --\nunderstanding the fine-grained contents (e.g., objects, relationships) in the\nimage, but also telling why -- reasoning over grounding visual cues to derive\nthe answer for a question. Over the last few years, we have seen significant\nprogress on visual question answering. Though impressive as the accuracy grows,\nit still lags behind to get knowing whether these models are undertaking\ngrounding visual reasoning or just leveraging spurious correlations in the\ntraining data. Recently, a number of works have attempted to answer this\nquestion from perspectives such as grounding and robustness. However, most of\nthem are either focusing on the language side or coarsely studying the\npixel-level attention maps. In this paper, by leveraging the step-wise object\ngrounding annotations provided in the GQA dataset, we first present a\nsystematical object-centric diagnosis of visual reasoning on grounding and\nrobustness, particularly on the vision side. According to the extensive\ncomparisons across different models, we find that even models with high\naccuracy are not good at grounding objects precisely, nor robust to visual\ncontent perturbations. In contrast, symbolic and modular models have a\nrelatively better grounding and robustness, though at the cost of accuracy. To\nreconcile these different aspects, we further develop a diagnostic model,\nnamely Graph Reasoning Machine. Our model replaces purely symbolic visual\nrepresentation with probabilistic scene graph and then applies teacher-forcing\ntraining for the visual reasoning module. The designed model improves the\nperformance on all three metrics over the vanilla neural-symbolic model while\ninheriting the transparency. Further ablation studies suggest that this\nimprovement is mainly due to more accurate image understanding and proper\nintermediate reasoning supervisions.", "comment": null, "links": []}
{"entry_id": "2012.07966", "title": "Odd-One-Out Representation Learning", "authors": ["Salman Mohammadi", "Anders Kirk Uhrenholt", "Bjørn Sand Jensen"], "published": "2020-12-14 22:01:15", "updated": "2020-12-14 22:01:15", "summary": "The effective application of representation learning to real-world problems\nrequires both techniques for learning useful representations, and also robust\nways to evaluate properties of representations. Recent work in disentangled\nrepresentation learning has shown that unsupervised representation learning\napproaches rely on fully supervised disentanglement metrics, which assume\naccess to labels for ground-truth factors of variation. In many real-world\ncases ground-truth factors are expensive to collect, or difficult to model,\nsuch as for perception. Here we empirically show that a weakly-supervised\ndownstream task based on odd-one-out observations is suitable for model\nselection by observing high correlation on a difficult downstream abstract\nvisual reasoning task. We also show that a bespoke metric-learning VAE model\nwhich performs highly on this task also out-performs other standard\nunsupervised and a weakly-supervised disentanglement model across several\nmetrics.", "comment": null, "links": []}
{"entry_id": "2012.07000", "title": "KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning", "authors": ["Dandan Song", "Siyi Ma", "Zhanchen Sun", "Sicheng Yang", "Lejian Liao"], "published": "2020-12-13 08:22:33", "updated": "2020-12-13 08:22:33", "summary": "Reasoning is a critical ability towards complete visual understanding. To\ndevelop machine with cognition-level visual understanding and reasoning\nabilities, the visual commonsense reasoning (VCR) task has been introduced. In\nVCR, given a challenging question about an image, a machine must answer\ncorrectly and then provide a rationale justifying its answer. The methods\nadopting the powerful BERT model as the backbone for learning joint\nrepresentation of image content and natural language have shown promising\nimprovements on VCR. However, none of the existing methods have utilized\ncommonsense knowledge in visual commonsense reasoning, which we believe will be\ngreatly helpful in this task. With the support of commonsense knowledge,\ncomplex questions even if the required information is not depicted in the image\ncan be answered with cognitive reasoning. Therefore, we incorporate commonsense\nknowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced\nVisual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual\nand linguistic contents as input, external commonsense knowledge extracted from\nConceptNet is integrated into the multi-layer Transformer. In order to reserve\nthe structural information and semantic representation of the original\nsentence, we propose using relative position embedding and mask-self-attention\nto weaken the effect between the injected commonsense knowledge and other\nunrelated components in the input sequence. Compared to other task-specific\nmodels and general task-agnostic pre-training models, our KVL-BERT outperforms\nthem by a large margin.", "comment": null, "links": []}
{"entry_id": "2012.01944", "title": "Multi-Label Contrastive Learning for Abstract Visual Reasoning", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2020-12-03 14:18:15", "updated": "2020-12-03 14:18:15", "summary": "For a long time the ability to solve abstract reasoning tasks was considered\none of the hallmarks of human intelligence. Recent advances in application of\ndeep learning (DL) methods led, as in many other domains, to surpassing human\nabstract reasoning performance, specifically in the most popular type of such\nproblems - the Raven's Progressive Matrices (RPMs). While the efficacy of DL\nsystems is indeed impressive, the way they approach the RPMs is very different\nfrom that of humans. State-of-the-art systems solving RPMs rely on massive\npattern-based training and sometimes on exploiting biases in the dataset,\nwhereas humans concentrate on identification of the rules / concepts underlying\nthe RPM (or generally a visual reasoning task) to be solved. Motivated by this\ncognitive difference, this work aims at combining DL with human way of solving\nRPMs and getting the best of both worlds. Specifically, we cast the problem of\nsolving RPMs into multi-label classification framework where each RPM is viewed\nas a multi-label data point, with labels determined by the set of abstract\nrules underlying the RPM. For efficient training of the system we introduce a\ngeneralisation of the Noise Contrastive Estimation algorithm to the case of\nmulti-label samples. Furthermore, we propose a new sparse rule encoding scheme\nfor RPMs which, besides the new training algorithm, is the key factor\ncontributing to the state-of-the-art performance. The proposed approach is\nevaluated on two most popular benchmark datasets (Balanced-RAVEN and PGM) and\non both of them demonstrates an advantage over the current state-of-the-art\nresults. Contrary to applications of contrastive learning methods reported in\nother domains, the state-of-the-art performance reported in the paper is\nachieved with no need for large batch sizes or strong data augmentation.", "comment": null, "links": ["http://dx.doi.org/10.1109/TNNLS.2022.3185949"]}
{"entry_id": "2011.04006", "title": "Long Range Arena: A Benchmark for Efficient Transformers", "authors": ["Yi Tay", "Mostafa Dehghani", "Samira Abnar", "Yikang Shen", "Dara Bahri", "Philip Pham", "Jinfeng Rao", "Liu Yang", "Sebastian Ruder", "Donald Metzler"], "published": "2020-11-08 15:53:56", "updated": "2020-11-08 15:53:56", "summary": "Transformers do not scale very well to long sequence lengths largely because\nof quadratic self-attention complexity. In the recent months, a wide spectrum\nof efficient, fast Transformers have been proposed to tackle this problem, more\noften than not claiming superior or comparable model quality to vanilla\nTransformer models. To this date, there is no well-established consensus on how\nto evaluate this class of models. Moreover, inconsistent benchmarking on a wide\nspectrum of tasks and datasets makes it difficult to assess relative model\nquality amongst many models. This paper proposes a systematic and unified\nbenchmark, LRA, specifically focused on evaluating model quality under\nlong-context scenarios. Our benchmark is a suite of tasks consisting of\nsequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data\ntypes and modalities such as text, natural, synthetic images, and mathematical\nexpressions requiring similarity, structural, and visual-spatial reasoning. We\nsystematically evaluate ten well-established long-range Transformer models\n(Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers,\nSynthesizers, Sparse Transformers, and Longformers) on our newly proposed\nbenchmark suite. LRA paves the way towards better understanding this class of\nefficient Transformer models, facilitates more research in this direction, and\npresents new challenging tasks to tackle. Our benchmark code will be released\nat https://github.com/google-research/long-range-arena.", "comment": null, "links": []}
{"entry_id": "1910.03230", "title": "Meta Module Network for Compositional Visual Reasoning", "authors": ["Wenhu Chen", "Zhe Gan", "Linjie Li", "Yu Cheng", "William Wang", "Jingjing Liu"], "published": "2019-10-08 06:28:24", "updated": "2020-11-08 02:52:51", "summary": "Neural Module Network (NMN) exhibits strong interpretability and\ncompositionality thanks to its handcrafted neural modules with explicit\nmulti-hop reasoning capability. However, most NMNs suffer from two critical\ndrawbacks: 1) scalability: customized module for specific function renders it\nimpractical when scaling up to a larger set of functions in complex tasks; 2)\ngeneralizability: rigid pre-defined module inventory makes it difficult to\ngeneralize to unseen functions in new tasks/domains. To design a more powerful\nNMN architecture for practical use, we propose Meta Module Network (MMN)\ncentered on a novel meta module, which can take in function recipes and morph\ninto diverse instance modules dynamically. The instance modules are then woven\ninto an execution graph for complex visual reasoning, inheriting the strong\nexplainability and compositionality of NMN. With such a flexible instantiation\nmechanism, the parameters of instance modules are inherited from the central\nmeta module, retaining the same model complexity as the function set grows,\nwhich promises better scalability. Meanwhile, as functions are encoded into the\nembedding space, unseen functions can be readily represented based on its\nstructural similarity with previously observed ones, which ensures better\ngeneralizability. Experiments on GQA and CLEVR datasets validate the\nsuperiority of MMN over state-of-the-art NMN designs. Synthetic experiments on\nheld-out unseen functions from GQA dataset also demonstrate the strong\ngeneralizability of MMN. Our code and model are released in Github\nhttps://github.com/wenhuchen/Meta-Module-Network.", "comment": "Accepted to WACV 21 (Oral)", "links": []}
{"entry_id": "2006.06195", "title": "Large-Scale Adversarial Training for Vision-and-Language Representation Learning", "authors": ["Zhe Gan", "Yen-Chun Chen", "Linjie Li", "Chen Zhu", "Yu Cheng", "Jingjing Liu"], "published": "2020-06-11 05:14:35", "updated": "2020-10-22 18:12:53", "summary": "We present VILLA, the first known effort on large-scale adversarial training\nfor vision-and-language (V+L) representation learning. VILLA consists of two\ntraining stages: (i) task-agnostic adversarial pre-training; followed by (ii)\ntask-specific adversarial finetuning. Instead of adding adversarial\nperturbations on image pixels and textual tokens, we propose to perform\nadversarial training in the embedding space of each modality. To enable\nlarge-scale training, we adopt the \"free\" adversarial training strategy, and\ncombine it with KL-divergence-based regularization to promote higher invariance\nin the embedding space. We apply VILLA to current best-performing V+L models,\nand achieve new state of the art on a wide range of tasks, including Visual\nQuestion Answering, Visual Commonsense Reasoning, Image-Text Retrieval,\nReferring Expression Comprehension, Visual Entailment, and NLVR2.", "comment": "NeurIPS 2020 Spotlight paper", "links": []}
{"entry_id": "2010.07526", "title": "Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs", "authors": ["Ana Marasović", "Chandra Bhagavatula", "Jae Sung Park", "Ronan Le Bras", "Noah A. Smith", "Yejin Choi"], "published": "2020-10-15 05:08:56", "updated": "2020-10-15 05:08:56", "summary": "Natural language rationales could provide intuitive, higher-level\nexplanations that are easily understandable by humans, complementing the more\nbroadly studied lower-level explanations based on gradients or attention\nweights. We present the first study focused on generating natural language\nrationales across several complex visual reasoning tasks: visual commonsense\nreasoning, visual-textual entailment, and visual question answering. The key\nchallenge of accurate rationalization is comprehensive image understanding at\nall levels: not just their explicit content at the pixel level, but their\ncontextual contents at the semantic and pragmatic levels. We present\nRationale^VT Transformer, an integrated model that learns to generate free-text\nrationales by combining pretrained language models with object recognition,\ngrounded visual semantic frames, and visual commonsense graphs. Our experiments\nshow that the base pretrained language model benefits from visual adaptation\nand that free-text rationalization is a promising research direction to\ncomplement model interpretability for complex visual-textual reasoning tasks.", "comment": "Accepted to Findings of EMNLP", "links": []}
{"entry_id": "2010.05633", "title": "Contextual Modulation for Relation-Level Metaphor Identification", "authors": ["Omnia Zayed", "John P. McCrae", "Paul Buitelaar"], "published": "2020-10-12 12:07:02", "updated": "2020-10-12 12:07:02", "summary": "Identifying metaphors in text is very challenging and requires comprehending\nthe underlying comparison. The automation of this cognitive process has gained\nwide attention lately. However, the majority of existing approaches concentrate\non word-level identification by treating the task as either single-word\nclassification or sequential labelling without explicitly modelling the\ninteraction between the metaphor components. On the other hand, while existing\nrelation-level approaches implicitly model this interaction, they ignore the\ncontext where the metaphor occurs. In this work, we address these limitations\nby introducing a novel architecture for identifying relation-level metaphoric\nexpressions of certain grammatical relations based on contextual modulation. In\na methodology inspired by works in visual reasoning, our approach is based on\nconditioning the neural network computation on the deep contextualised features\nof the candidate expressions using feature-wise linear modulation. We\ndemonstrate that the proposed architecture achieves state-of-the-art results on\nbenchmark datasets. The proposed methodology is generic and could be applied to\nother textual classification problems that benefit from contextual interaction.", "comment": "accepted at Findings of EMNLP 2020", "links": []}
{"entry_id": "2009.09154", "title": "CLEVR Parser: A Graph Parser Library for Geometric Learning on Language Grounded Image Scenes", "authors": ["Raeid Saqur", "Ameet Deshpande"], "published": "2020-09-19 03:32:37", "updated": "2020-10-01 22:56:35", "summary": "The CLEVR dataset has been used extensively in language grounded visual\nreasoning in Machine Learning (ML) and Natural Language Processing (NLP)\ndomains. We present a graph parser library for CLEVR, that provides\nfunctionalities for object-centric attributes and relationships extraction, and\nconstruction of structural graph representations for dual modalities.\nStructural order-invariant representations enable geometric learning and can\naid in downstream tasks like language grounding to vision, robotics,\ncompositionality, interpretability, and computational grammar construction. We\nprovide three extensible main components - parser, embedder, and visualizer\nthat can be tailored to suit specific learning setups. We also provide\nout-of-the-box functionality for seamless integration with popular deep graph\nneural network (GNN) libraries. Additionally, we discuss downstream usage and\napplications of the library, and how it accelerates research for the NLP\nresearch community.", "comment": "Accepted at NLP-OSS, EMNLP 2020 (2nd Workshop for Natural Language\n  Processing Open Source Software)", "links": []}
{"entry_id": "2007.14516", "title": "Visual Reasoning Strategies for Effect Size Judgments and Decisions", "authors": ["Alex Kale", "Matthew Kay", "Jessica Hullman"], "published": "2020-07-28 22:56:32", "updated": "2020-09-12 20:21:47", "summary": "Uncertainty visualizations often emphasize point estimates to support\nmagnitude estimates or decisions through visual comparison. However, when\ndesign choices emphasize means, users may overlook uncertainty information and\nmisinterpret visual distance as a proxy for effect size. We present findings\nfrom a mixed design experiment on Mechanical Turk which tests eight uncertainty\nvisualization designs: 95% containment intervals, hypothetical outcome plots,\ndensities, and quantile dotplots, each with and without means added. We find\nthat adding means to uncertainty visualizations has small biasing effects on\nboth magnitude estimation and decision-making, consistent with discounting\nuncertainty. We also see that visualization designs that support the least\nbiased effect size estimation do not support the best decision-making,\nsuggesting that a chart user's sense of effect size may not necessarily be\nidentical when they use the same information for different tasks. In a\nqualitative analysis of users' strategy descriptions, we find that many users\nswitch strategies and do not employ an optimal strategy when one exists.\nUncertainty visualizations which are optimally designed in theory may not be\nthe most effective in practice because of the ways that users satisfice with\nheuristics, suggesting opportunities to better understand visualization\neffectiveness by modeling sets of potential strategies.", "comment": "Accepted for publication at IEEE VIS 2020", "links": []}
{"entry_id": "2009.05678", "title": "To Root Artificial Intelligence Deeply in Basic Science for a New Generation of AI", "authors": ["Jingan Yang", "Yang Peng"], "published": "2020-09-11 22:38:38", "updated": "2020-09-11 22:38:38", "summary": "One of the ambitions of artificial intelligence is to root artificial\nintelligence deeply in basic science while developing brain-inspired artificial\nintelligence platforms that will promote new scientific discoveries. The\nchallenges are essential to push artificial intelligence theory and applied\ntechnologies research forward. This paper presents the grand challenges of\nartificial intelligence research for the next 20 years which include:~(i) to\nexplore the working mechanism of the human brain on the basis of understanding\nbrain science, neuroscience, cognitive science, psychology and data science;\n(ii) how is the electrical signal transmitted by the human brain? What is the\ncoordination mechanism between brain neural electrical signals and human\nactivities? (iii)~to root brain-computer interface~(BCI) and brain-muscle\ninterface~(BMI) technologies deeply in science on human behaviour; (iv)~making\nresearch on knowledge-driven visual commonsense reasoning~(VCR), develop a new\ninference engine for cognitive network recognition~(CNR); (v)~to develop\nhigh-precision, multi-modal intelligent perceptrons; (vi)~investigating\nintelligent reasoning and fast decision-making systems based on knowledge\ngraph~(KG). We believe that the frontier theory innovation of AI,\nknowledge-driven modeling methodologies for commonsense reasoning,\nrevolutionary innovation and breakthroughs of the novel algorithms and new\ntechnologies in AI, and developing responsible AI should be the main research\nstrategies of AI scientists in the future.", "comment": "13 pages; 7 figures; 23 references", "links": []}
{"entry_id": "1904.08324", "title": "Question Guided Modular Routing Networks for Visual Question Answering", "authors": ["Yanze Wu", "Qiang Sun", "Jianqi Ma", "Bin Li", "Yanwei Fu", "Yao Peng", "Xiangyang Xue"], "published": "2019-04-17 15:45:13", "updated": "2020-09-04 17:21:28", "summary": "This paper studies the task of Visual Question Answering (VQA), which is\ntopical in Multimedia community recently. Particularly, we explore two critical\nresearch problems existed in VQA: (1) efficiently fusing the visual and textual\nmodalities; (2) enabling the visual reasoning ability of VQA models in\nanswering complex questions. To address these challenging problems, a novel\nQuestion Guided Modular Routing Networks (QGMRN) has been proposed in this\npaper. Particularly, The QGMRN is composed of visual, textual and routing\nnetwork. The visual and textual network serve as the backbones for the generic\nfeature extractors of visual and textual modalities. QGMRN can fuse the visual\nand textual modalities at multiple semantic levels. Typically, the visual\nreasoning is facilitated by the routing network in a discrete and stochastic\nway by using Gumbel-Softmax trick for module selection. When the input reaches\na certain modular layer, routing network newly proposed in this paper,\ndynamically selects a portion of modules from that layer to process the input\ndepending on the question features generated by the textual network. It can\nalso learn to reason by routing between the generic modules without additional\nsupervision information or expert knowledge. Benefiting from the dynamic\nrouting mechanism, QGMRN can outperform the previous classical VQA methods by a\nlarge margin and achieve the competitive results against the state-of-the-art\nmethods. Furthermore, attention mechanism is integrated into our QGMRN model\nand thus can further boost the model performance. Empirically, extensive\nexperiments on the CLEVR and CLEVR-Humans datasets validate the effectiveness\nof our proposed model, and the state-of-the-art performance has been achieved.", "comment": null, "links": []}
{"entry_id": "2009.01067", "title": "Video Captioning Using Weak Annotation", "authors": ["Jingyi Hou", "Yunde Jia", "Xinxiao wu", "Yayun Qi"], "published": "2020-09-02 13:45:01", "updated": "2020-09-02 13:45:01", "summary": "Video captioning has shown impressive progress in recent years. One key\nreason of the performance improvements made by existing methods lie in massive\npaired video-sentence data, but collecting such strong annotation, i.e.,\nhigh-quality sentences, is time-consuming and laborious. It is the fact that\nthere now exist an amazing number of videos with weak annotation that only\ncontains semantic concepts such as actions and objects. In this paper, we\ninvestigate using weak annotation instead of strong annotation to train a video\ncaptioning model. To this end, we propose a progressive visual reasoning method\nthat progressively generates fine sentences from weak annotations by inferring\nmore semantic concepts and their dependency relationships for video captioning.\nTo model concept relationships, we use dependency trees that are spanned by\nexploiting external knowledge from large sentence corpora. Through traversing\nthe dependency trees, the sentences are generated to train the captioning\nmodel. Accordingly, we develop an iterative refinement algorithm that refines\nsentences via spanning dependency trees and fine-tunes the captioning model\nusing the refined sentences in an alternative training manner. Experimental\nresults demonstrate that our method using weak annotation is very competitive\nto the state-of-the-art methods using strong annotation.", "comment": null, "links": []}
{"entry_id": "2006.11524", "title": "Neuro-Symbolic Visual Reasoning: Disentangling \"Visual\" from \"Reasoning\"", "authors": ["Saeed Amizadeh", "Hamid Palangi", "Oleksandr Polozov", "Yichen Huang", "Kazuhito Koishida"], "published": "2020-06-20 08:48:29", "updated": "2020-08-25 23:30:57", "summary": "Visual reasoning tasks such as visual question answering (VQA) require an\ninterplay of visual perception with reasoning about the question semantics\ngrounded in perception. However, recent advances in this area are still\nprimarily driven by perception improvements (e.g. scene graph generation)\nrather than reasoning. Neuro-symbolic models such as Neural Module Networks\nbring the benefits of compositional reasoning to VQA, but they are still\nentangled with visual representation learning, and thus neural reasoning is\nhard to improve and assess on its own. To address this, we propose (1) a\nframework to isolate and evaluate the reasoning aspect of VQA separately from\nits perception, and (2) a novel top-down calibration technique that allows the\nmodel to answer reasoning questions even with imperfect perception. To this\nend, we introduce a differentiable first-order logic formalism for VQA that\nexplicitly decouples question answering from visual perception. On the\nchallenging GQA dataset, this framework is used to perform in-depth,\ndisentangled comparisons between well-known VQA models leading to informative\ninsights regarding the participating models as well as the task.", "comment": "Published in Proceedings of the 37th International Conference on\n  Machine Learning (ICML), Online, PMLR 119, 2020", "links": []}
{"entry_id": "2003.12462", "title": "TextCaps: a Dataset for Image Captioning with Reading Comprehension", "authors": ["Oleksii Sidorov", "Ronghang Hu", "Marcus Rohrbach", "Amanpreet Singh"], "published": "2020-03-24 02:38:35", "updated": "2020-08-04 04:08:02", "summary": "Image descriptions can help visually impaired people to quickly understand\nthe image content. While we made significant progress in automatically\ndescribing images and optical character recognition, current approaches are\nunable to include written text in their descriptions, although text is\nomnipresent in human environments and frequently critical to understand our\nsurroundings. To study how to comprehend text in the context of an image we\ncollect a novel dataset, TextCaps, with 145k captions for 28k images. Our\ndataset challenges a model to recognize text, relate it to its visual context,\nand decide what part of the text to copy or paraphrase, requiring spatial,\nsemantic, and visual reasoning between multiple text tokens and visual\nentities, such as objects. We study baselines and adapt existing approaches to\nthis new task, which we refer to as image captioning with reading\ncomprehension. Our analysis with automatic and human studies shows that our new\nTextCaps dataset provides many new technical challenges over previous datasets.", "comment": "To appear in ECCV 2020 (oral) Project page:\n  https://textvqa.org/textcaps", "links": []}
{"entry_id": "2004.10796", "title": "VisualCOMET: Reasoning about the Dynamic Context of a Still Image", "authors": ["Jae Sung Park", "Chandra Bhagavatula", "Roozbeh Mottaghi", "Ali Farhadi", "Yejin Choi"], "published": "2020-04-22 19:02:20", "updated": "2020-08-01 13:11:10", "summary": "Even from a single frame of a still image, people can reason about the\ndynamic story of the image before, after, and beyond the frame. For example,\ngiven an image of a man struggling to stay afloat in water, we can reason that\nthe man fell into the water sometime in the past, the intent of that man at the\nmoment is to stay alive, and he will need help in the near future or else he\nwill get washed away. We propose VisualComet, the novel framework of visual\ncommonsense reasoning tasks to predict events that might have happened before,\nevents that might happen next, and the intents of the people at present. To\nsupport research toward visual commonsense reasoning, we introduce the first\nlarge-scale repository of Visual Commonsense Graphs that consists of over 1.4\nmillion textual descriptions of visual commonsense inferences carefully\nannotated over a diverse set of 60,000 images, each paired with short video\nsummaries of before and after. In addition, we provide person-grounding (i.e.,\nco-reference links) between people appearing in the image and people mentioned\nin the textual commonsense descriptions, allowing for tighter integration\nbetween images and text. We establish strong baseline performances on this task\nand demonstrate that integration between visual and textual commonsense\nreasoning is the key and wins over non-integrative alternatives.", "comment": "Project Page: http://visualcomet.xyz (ECCV 2020 Spotlight)", "links": []}
{"entry_id": "2007.12020", "title": "Few-shot Visual Reasoning with Meta-analogical Contrastive Learning", "authors": ["Youngsung Kim", "Jinwoo Shin", "Eunho Yang", "Sung Ju Hwang"], "published": "2020-07-23 14:00:34", "updated": "2020-07-23 14:00:34", "summary": "While humans can solve a visual puzzle that requires logical reasoning by\nobserving only few samples, it would require training over large amount of data\nfor state-of-the-art deep reasoning models to obtain similar performance on the\nsame task. In this work, we propose to solve such a few-shot (or low-shot)\nvisual reasoning problem, by resorting to analogical reasoning, which is a\nunique human ability to identify structural or relational similarity between\ntwo sets. Specifically, given training and test sets that contain the same type\nof visual reasoning problems, we extract the structural relationships between\nelements in both domains, and enforce them to be as similar as possible with\nanalogical learning. We repeatedly apply this process with slightly modified\nqueries of the same problem under the assumption that it does not affect the\nrelationship between a training and a test sample. This allows to learn the\nrelational similarity between the two samples in an effective manner even with\na single pair of samples. We validate our method on RAVEN dataset, on which it\noutperforms state-of-the-art method, with larger gains when the training data\nis scarce. We further meta-learn our analogical contrastive learning model over\nthe same tasks with diverse attributes, and show that it generalizes to the\nsame visual reasoning problem with unseen attributes.", "comment": null, "links": []}
{"entry_id": "1909.11740", "title": "UNITER: UNiversal Image-TExt Representation Learning", "authors": ["Yen-Chun Chen", "Linjie Li", "Licheng Yu", "Ahmed El Kholy", "Faisal Ahmed", "Zhe Gan", "Yu Cheng", "Jingjing Liu"], "published": "2019-09-25 20:02:54", "updated": "2020-07-17 22:19:59", "summary": "Joint image-text embedding is the bedrock for most Vision-and-Language (V+L)\ntasks, where multimodality inputs are simultaneously processed for joint visual\nand textual understanding. In this paper, we introduce UNITER, a UNiversal\nImage-TExt Representation, learned through large-scale pre-training over four\nimage-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU\nCaptions), which can power heterogeneous downstream V+L tasks with joint\nmultimodal embeddings. We design four pre-training tasks: Masked Language\nModeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text\nMatching (ITM), and Word-Region Alignment (WRA). Different from previous work\nthat applies joint random masking to both modalities, we use conditional\nmasking on pre-training tasks (i.e., masked language/region modeling is\nconditioned on full observation of image/text). In addition to ITM for global\nimage-text alignment, we also propose WRA via the use of Optimal Transport (OT)\nto explicitly encourage fine-grained alignment between words and image regions\nduring pre-training. Comprehensive analysis shows that both conditional masking\nand OT-based WRA contribute to better pre-training. We also conduct a thorough\nablation study to find an optimal combination of pre-training tasks. Extensive\nexperiments show that UNITER achieves new state of the art across six V+L tasks\n(over nine datasets), including Visual Question Answering, Image-Text\nRetrieval, Referring Expression Comprehension, Visual Commonsense Reasoning,\nVisual Entailment, and NLVR$^2$. Code is available at\nhttps://github.com/ChenRocks/UNITER.", "comment": "ECCV 2020", "links": []}
{"entry_id": "2007.09049", "title": "Learning to Discretely Compose Reasoning Module Networks for Video Captioning", "authors": ["Ganchao Tan", "Daqing Liu", "Meng Wang", "Zheng-Jun Zha"], "published": "2020-07-17 15:27:37", "updated": "2020-07-17 15:27:37", "summary": "Generating natural language descriptions for videos, i.e., video captioning,\nessentially requires step-by-step reasoning along the generation process. For\nexample, to generate the sentence \"a man is shooting a basketball\", we need to\nfirst locate and describe the subject \"man\", next reason out the man is\n\"shooting\", then describe the object \"basketball\" of shooting. However,\nexisting visual reasoning methods designed for visual question answering are\nnot appropriate to video captioning, for it requires more complex visual\nreasoning on videos over both space and time, and dynamic module composition\nalong the generation process. In this paper, we propose a novel visual\nreasoning approach for video captioning, named Reasoning Module Networks (RMN),\nto equip the existing encoder-decoder framework with the above reasoning\ncapacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal\nreasoning modules, and 2) a dynamic and discrete module selector trained by a\nlinguistic loss with a Gumbel approximation. Extensive experiments on MSVD and\nMSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art\nmethods while providing an explicit and explainable generation process. Our\ncode is available at https://github.com/tgc1997/RMN.", "comment": "Accepted at IJCAI 2020 Main Track. Sole copyright holder is IJCAI.\n  Code is available at https://github.com/tgc1997/RMN", "links": []}
{"entry_id": "2007.04670", "title": "Multi-Granularity Modularized Network for Abstract Visual Reasoning", "authors": ["Xiangru Tang", "Haoyuan Wang", "Xiang Pan", "Jiyang Qi"], "published": "2020-07-09 09:54:05", "updated": "2020-07-10 02:32:25", "summary": "Abstract visual reasoning connects mental abilities to the physical world,\nwhich is a crucial factor in cognitive development. Most toddlers display\nsensitivity to this skill, but it is not easy for machines. Aimed at it, we\nfocus on the Raven Progressive Matrices Test, designed to measure cognitive\nreasoning. Recent work designed some black-boxes to solve it in an end-to-end\nfashion, but they are incredibly complicated and difficult to explain. Inspired\nby cognitive studies, we propose a Multi-Granularity Modularized Network (MMoN)\nto bridge the gap between the processing of raw sensory information and\nsymbolic reasoning. Specifically, it learns modularized reasoning functions to\nmodel the semantic rule from the visual grounding in a neuro-symbolic and\nsemi-supervision way. To comprehensively evaluate MMoN, our experiments are\nconducted on the dataset of both seen and unseen reasoning rules. The result\nshows that MMoN is well suited for abstract visual reasoning and also\nexplainable on the generalization test.", "comment": null, "links": []}
{"entry_id": "2006.14264", "title": "Self-Segregating and Coordinated-Segregating Transformer for Focused Deep Multi-Modular Network for Visual Question Answering", "authors": ["Chiranjib Sur"], "published": "2020-06-25 09:17:03", "updated": "2020-06-25 09:17:03", "summary": "Attention mechanism has gained huge popularity due to its effectiveness in\nachieving high accuracy in different domains. But attention is opportunistic\nand is not justified by the content or usability of the content. Transformer\nlike structure creates all/any possible attention(s). We define segregating\nstrategies that can prioritize the contents for the applications for\nenhancement of performance. We defined two strategies: Self-Segregating\nTransformer (SST) and Coordinated-Segregating Transformer (CST) and used it to\nsolve visual question answering application. Self-segregation strategy for\nattention contributes in better understanding and filtering the information\nthat can be most helpful for answering the question and create diversity of\nvisual-reasoning for attention. This work can easily be used in many other\napplications that involve repetition and multiple frames of features and would\nreduce the commonality of the attentions to a great extent. Visual Question\nAnswering (VQA) requires understanding and coordination of both images and\ntextual interpretations. Experiments demonstrate that segregation strategies\nfor cascaded multi-head transformer attention outperforms many previous works\nand achieved considerable improvement for VQA-v2 dataset benchmark.", "comment": null, "links": []}
{"entry_id": "2004.00849", "title": "Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers", "authors": ["Zhicheng Huang", "Zhaoyang Zeng", "Bei Liu", "Dongmei Fu", "Jianlong Fu"], "published": "2020-04-02 07:39:28", "updated": "2020-06-22 09:09:22", "summary": "We propose Pixel-BERT to align image pixels with text by deep multi-modal\ntransformers that jointly learn visual and language embedding in a unified\nend-to-end framework. We aim to build a more accurate and thorough connection\nbetween image pixels and language semantics directly from image and sentence\npairs instead of using region-based image features as the most recent vision\nand language tasks. Our Pixel-BERT which aligns semantic connection in pixel\nand text level solves the limitation of task-specific visual representation for\nvision and language tasks. It also relieves the cost of bounding box\nannotations and overcomes the unbalance between semantic labels in visual task\nand language semantic. To provide a better representation for down-stream\ntasks, we pre-train a universal end-to-end model with image and sentence pairs\nfrom Visual Genome dataset and MS-COCO dataset. We propose to use a random\npixel sampling mechanism to enhance the robustness of visual representation and\nto apply the Masked Language Model and Image-Text Matching as pre-training\ntasks. Extensive experiments on downstream tasks with our pre-trained model\nshow that our approach makes the most state-of-the-arts in downstream tasks,\nincluding Visual Question Answering (VQA), image-text retrieval, Natural\nLanguage for Visual Reasoning for Real (NLVR). Particularly, we boost the\nperformance of a single model in VQA task by 2.17 points compared with SOTA\nunder fair comparison.", "comment": null, "links": []}
{"entry_id": "2006.11197", "title": "Abstract Diagrammatic Reasoning with Multiplex Graph Networks", "authors": ["Duo Wang", "Mateja Jamnik", "Pietro Lio"], "published": "2020-06-19 15:50:25", "updated": "2020-06-19 15:50:25", "summary": "Abstract reasoning, particularly in the visual domain, is a complex human\nability, but it remains a challenging problem for artificial neural learning\nsystems. In this work we propose MXGNet, a multilayer graph neural network for\nmulti-panel diagrammatic reasoning tasks. MXGNet combines three powerful\nconcepts, namely, object-level representation, graph neural networks and\nmultiplex graphs, for solving visual reasoning tasks. MXGNet first extracts\nobject-level representations for each element in all panels of the diagrams,\nand then forms a multi-layer multiplex graph capturing multiple relations\nbetween objects across different diagram panels. MXGNet summarises the multiple\ngraphs extracted from the diagrams of the task, and uses this summarisation to\npick the most probable answer from the given candidates. We have tested MXGNet\non two types of diagrammatic reasoning tasks, namely Diagram Syllogisms and\nRaven Progressive Matrices (RPM). For an Euler Diagram Syllogism task MXGNet\nachieves state-of-the-art accuracy of 99.8%. For PGM and RAVEN, two\ncomprehensive datasets for RPM reasoning, MXGNet outperforms the\nstate-of-the-art models by a considerable margin.", "comment": null, "links": []}
{"entry_id": "2006.05398", "title": "Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image", "authors": ["Danny Driess", "Jung-Su Ha", "Marc Toussaint"], "published": "2020-06-09 16:52:02", "updated": "2020-06-09 16:52:02", "summary": "In this paper, we propose a deep convolutional recurrent neural network that\npredicts action sequences for task and motion planning (TAMP) from an initial\nscene image. Typical TAMP problems are formalized by combining reasoning on a\nsymbolic, discrete level (e.g. first-order logic) with continuous motion\nplanning such as nonlinear trajectory optimization. Due to the great\ncombinatorial complexity of possible discrete action sequences, a large number\nof optimization/motion planning problems have to be solved to find a solution,\nwhich limits the scalability of these approaches.\n  To circumvent this combinatorial complexity, we develop a neural network\nwhich, based on an initial image of the scene, directly predicts promising\ndiscrete action sequences such that ideally only one motion planning problem\nhas to be solved to find a solution to the overall TAMP problem. A key aspect\nis that our method generalizes to scenes with many and varying number of\nobjects, although being trained on only two objects at a time. This is possible\nby encoding the objects of the scene in images as input to the neural network,\ninstead of a fixed feature vector. Results show runtime improvements of several\nmagnitudes. Video: https://youtu.be/i8yyEbbvoEk", "comment": "Robotics: Science and Systems (R:SS) 2020", "links": []}
{"entry_id": "2004.12770", "title": "Differentiable Adaptive Computation Time for Visual Reasoning", "authors": ["Cristobal Eyzaguirre", "Alvaro Soto"], "published": "2020-04-27 13:20:23", "updated": "2020-05-22 16:57:14", "summary": "This paper presents a novel attention-based algorithm for achieving adaptive\ncomputation called DACT, which, unlike existing ones, is end-to-end\ndifferentiable. Our method can be used in conjunction with many networks; in\nparticular, we study its application to the widely known MAC architecture,\nobtaining a significant reduction in the number of recurrent steps needed to\nachieve similar accuracies, therefore improving its performance to computation\nratio. Furthermore, we show that by increasing the maximum number of steps\nused, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR\ndataset, demonstrating that our approach is able to control the number of steps\nwithout significant loss of performance. Additional advantages provided by our\napproach include considerably improving interpretability by discarding useless\nsteps and providing more insights into the underlying reasoning process.\nFinally, we present adaptive computation as an equivalent to an ensemble of\nmodels, similar to a mixture of expert formulation. Both the code and the\nconfiguration files for our experiments are made available to support further\nresearch in this area.", "comment": "CVPR 2020", "links": []}
{"entry_id": "2005.09183", "title": "Retrieving and Highlighting Action with Spatiotemporal Reference", "authors": ["Seito Kasai", "Yuchi Ishikawa", "Masaki Hayashi", "Yoshimitsu Aoki", "Kensho Hara", "Hirokatsu Kataoka"], "published": "2020-05-19 03:12:31", "updated": "2020-05-19 03:12:31", "summary": "In this paper, we present a framework that jointly retrieves and\nspatiotemporally highlights actions in videos by enhancing current deep\ncross-modal retrieval methods. Our work takes on the novel task of action\nhighlighting, which visualizes where and when actions occur in an untrimmed\nvideo setting. Action highlighting is a fine-grained task, compared to\nconventional action recognition tasks which focus on classification or\nwindow-based localization. Leveraging weak supervision from annotated captions,\nour framework acquires spatiotemporal relevance maps and generates local\nembeddings which relate to the nouns and verbs in captions. Through\nexperiments, we show that our model generates various maps conditioned on\ndifferent actions, in which conventional visual reasoning methods only go as\nfar as to show a single deterministic saliency map. Also, our model improves\nretrieval recall over our baseline without alignment by 2-3% on the MSR-VTT\ndataset.", "comment": "Accepted to ICIP 2020", "links": []}
{"entry_id": "2005.06035", "title": "Cross-Modality Relevance for Reasoning on Language and Vision", "authors": ["Chen Zheng", "Quan Guo", "Parisa Kordjamshidi"], "published": "2020-05-12 20:17:25", "updated": "2020-05-12 20:17:25", "summary": "This work deals with the challenge of learning and reasoning over language\nand vision data for the related downstream tasks such as visual question\nanswering (VQA) and natural language for visual reasoning (NLVR). We design a\nnovel cross-modality relevance module that is used in an end-to-end framework\nto learn the relevance representation between components of various input\nmodalities under the supervision of a target task, which is more generalizable\nto unobserved data compared to merely reshaping the original representation\nspace. In addition to modeling the relevance between the textual entities and\nvisual entities, we model the higher-order relevance between entity relations\nin the text and object relations in the image. Our proposed approach shows\ncompetitive performance on two different language and vision tasks using public\nbenchmarks and improves the state-of-the-art published results. The learned\nalignments of input spaces and their relevance representations by NLVR task\nboost the training efficiency of VQA task.", "comment": "Accepted by ACL 2020", "links": []}
{"entry_id": "2004.12193", "title": "Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning", "authors": ["Wenhe Zhang", "Chi Zhang", "Yixin Zhu", "Song-Chun Zhu"], "published": "2020-04-25 17:14:58", "updated": "2020-04-25 17:14:58", "summary": "As a comprehensive indicator of mathematical thinking and intelligence, the\nnumber sense (Dehaene 2011) bridges the induction of symbolic concepts and the\ncompetence of problem-solving. To endow such a crucial cognitive ability to\nmachine intelligence, we propose a dataset, Machine Number Sense (MNS),\nconsisting of visual arithmetic problems automatically generated using a\ngrammar model--And-Or Graph (AOG). These visual arithmetic problems are in the\nform of geometric figures: each problem has a set of geometric shapes as its\ncontext and embedded number symbols. Solving such problems is not trivial; the\nmachine not only has to recognize the number, but also to interpret the number\nwith its contexts, shapes, and relations (e.g., symmetry) together with proper\noperations. We benchmark the MNS dataset using four predominant neural network\nmodels as baselines in this visual reasoning task. Comprehensive experiments\nshow that current neural-network-based models still struggle to understand\nnumber concepts and relational operations. We show that a simple brute-force\nsearch algorithm could work out some of the problems without context\ninformation. Crucially, taking geometric context into account by an additional\nperception module would provide a sharp performance gain with fewer search\nsteps. Altogether, we call for attention in fusing the classic search-based\nalgorithms with modern neural networks to discover the essential number\nconcepts in future research.", "comment": "AAAI 2020 Oral. Project page:\n  https://sites.google.com/view/number-sense/home Code:\n  https://github.com/zwh1999anne/Machine-Number-Sense-Dataset Dataset:\n  https://drive.google.com/file/d/17KuL8KOIDAeRL-lD418oiDEm8bE6TEFb/view", "links": []}
{"entry_id": "2004.02673", "title": "SHOP-VRB: A Visual Reasoning Benchmark for Object Perception", "authors": ["Michal Nazarczuk", "Krystian Mikolajczyk"], "published": "2020-04-06 13:46:54", "updated": "2020-04-06 13:46:54", "summary": "In this paper we present an approach and a benchmark for visual reasoning in\nrobotics applications, in particular small object grasping and manipulation.\nThe approach and benchmark are focused on inferring object properties from\nvisual and text data. It concerns small household objects with their\nproperties, functionality, natural language descriptions as well as\nquestion-answer pairs for visual reasoning queries along with their\ncorresponding scene semantic representations. We also present a method for\ngenerating synthetic data which allows to extend the benchmark to other objects\nor scenes and propose an evaluation protocol that is more challenging than in\nthe existing datasets. We propose a reasoning system based on symbolic program\nexecution. A disentangled representation of the visual and textual inputs is\nobtained and used to execute symbolic programs that represent a 'reasoning\nprocess' of the algorithm. We perform a set of experiments on the proposed\nbenchmark and compare to results for the state of the art methods. These\nresults expose the shortcomings of the existing benchmarks that may lead to\nmisleading conclusions on the actual performance of the visual reasoning\nsystems.", "comment": "International Conference on Robotics and Automation (ICRA) 2020", "links": []}
{"entry_id": "2001.02359", "title": "Weakly Supervised Visual Semantic Parsing", "authors": ["Alireza Zareian", "Svebor Karaman", "Shih-Fu Chang"], "published": "2020-01-08 03:46:13", "updated": "2020-03-31 18:54:06", "summary": "Scene Graph Generation (SGG) aims to extract entities, predicates and their\nsemantic structure from images, enabling deep understanding of visual content,\nwith many applications such as visual reasoning and image retrieval.\nNevertheless, existing SGG methods require millions of manually annotated\nbounding boxes for training, and are computationally inefficient, as they\nexhaustively process all pairs of object proposals to detect predicates. In\nthis paper, we address those two limitations by first proposing a generalized\nformulation of SGG, namely Visual Semantic Parsing, which disentangles entity\nand predicate recognition, and enables sub-quadratic performance. Then we\npropose the Visual Semantic Parsing Network, VSPNet, based on a dynamic,\nattention-based, bipartite message passing framework that jointly infers graph\nnodes and edges through an iterative process. Additionally, we propose the\nfirst graph-based weakly supervised learning framework, based on a novel graph\nalignment algorithm, which enables training without bounding box annotations.\nThrough extensive experiments, we show that VSPNet outperforms weakly\nsupervised baselines significantly and approaches fully supervised performance,\nwhile being several times faster. We publicly release the source code of our\nmethod.", "comment": "To be presented at CVPR 2020 (oral paper)", "links": []}
{"entry_id": "1902.10200", "title": "Differentiable Scene Graphs", "authors": ["Moshiko Raboh", "Roei Herzig", "Gal Chechik", "Jonathan Berant", "Amir Globerson"], "published": "2019-02-26 20:22:33", "updated": "2020-03-14 16:25:32", "summary": "Reasoning about complex visual scenes involves perception of entities and\ntheir relations. Scene graphs provide a natural representation for reasoning\ntasks, by assigning labels to both entities (nodes) and relations (edges).\nUnfortunately, reasoning systems based on SGs are typically trained in a\ntwo-step procedure: First, training a model to predict SGs from images; Then, a\nseparate model is created to reason based on predicted SGs. In many domains, it\nis preferable to train systems jointly in an end-to-end manner, but SGs are not\ncommonly used as intermediate components in visual reasoning systems because\nbeing discrete and sparse, scene-graph representations are non-differentiable\nand difficult to optimize. Here we propose Differentiable Scene Graphs (DSGs),\nan image representation that is amenable to differentiable end-to-end\noptimization, and requires supervision only from the downstream tasks. DSGs\nprovide a dense representation for all regions and pairs of regions, and do not\nspend modelling capacity on areas of the images that do not contain objects or\nrelations of interest. We evaluate our model on the challenging task of\nidentifying referring relationships (RR) in three benchmark datasets, Visual\nGenome, VRD and CLEVR. We describe a multi-task objective, and train in an\nend-to-end manner supervised by the downstream RR task. Using DSGs as an\nintermediate representation leads to new state-of-the-art performance.", "comment": "Winter Conference on Applications of Computer Vision (WACV), 2020", "links": []}
{"entry_id": "1910.01442", "title": "CLEVRER: CoLlision Events for Video REpresentation and Reasoning", "authors": ["Kexin Yi", "Chuang Gan", "Yunzhu Li", "Pushmeet Kohli", "Jiajun Wu", "Antonio Torralba", "Joshua B. Tenenbaum"], "published": "2019-10-03 13:16:36", "updated": "2020-03-08 00:09:07", "summary": "The ability to reason about temporal and causal events from videos lies at\nthe core of human intelligence. Most video reasoning benchmarks, however, focus\non pattern recognition from complex visual and language input, instead of on\ncausal structure. We study the complementary problem, exploring the temporal\nand causal structures behind videos of objects with simple visual appearance.\nTo this end, we introduce the CoLlision Events for Video REpresentation and\nReasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of\ncomputational models on a wide range of reasoning tasks. Motivated by the\ntheory of human casual judgment, CLEVRER includes four types of questions:\ndescriptive (e.g., \"what color\"), explanatory (\"what is responsible for\"),\npredictive (\"what will happen next\"), and counterfactual (\"what if\"). We\nevaluate various state-of-the-art models for visual reasoning on our benchmark.\nWhile these models thrive on the perception-based task (descriptive), they\nperform poorly on the causal tasks (explanatory, predictive and\ncounterfactual), suggesting that a principled approach for causal reasoning\nshould incorporate the capability of both perceiving complex visual and\nlanguage inputs, and understanding the underlying dynamics and causal\nrelations. We also study an oracle model that explicitly combines these\ncomponents via symbolic representations.", "comment": "The first two authors contributed equally to this work. Accepted as\n  Oral Spotlight as ICLR 2020. Project page: http://clevrer.csail.mit.edu/", "links": []}
{"entry_id": "2003.01835", "title": "Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data", "authors": ["Priya Sundaresan", "Jennifer Grannen", "Brijen Thananjeyan", "Ashwin Balakrishna", "Michael Laskey", "Kevin Stone", "Joseph E. Gonzalez", "Ken Goldberg"], "published": "2020-03-03 23:43:05", "updated": "2020-03-03 23:43:05", "summary": "Robotic manipulation of deformable 1D objects such as ropes, cables, and\nhoses is challenging due to the lack of high-fidelity analytic models and large\nconfiguration spaces. Furthermore, learning end-to-end manipulation policies\ndirectly from images and physical interaction requires significant time on a\nrobot and can fail to generalize across tasks. We address these challenges\nusing interpretable deep visual representations for rope, extending recent work\non dense object descriptors for robot manipulation. This facilitates the design\nof interpretable and transferable geometric policies built on top of the\nlearned representations, decoupling visual reasoning and control. We present an\napproach that learns point-pair correspondences between initial and goal rope\nconfigurations, which implicitly encodes geometric structure, entirely in\nsimulation from synthetic depth images. We demonstrate that the learned\nrepresentation -- dense depth object descriptors (DDODs) -- can be used to\nmanipulate a real rope into a variety of different arrangements either by\nlearning from demonstrations or using interpretable geometric policies. In 50\ntrials of a knot-tying task with the ABB YuMi Robot, the system achieves a 66%\nknot-tying success rate from previously unseen configurations. See\nhttps://tinyurl.com/rope-learning for supplementary material and videos.", "comment": null, "links": []}
{"entry_id": "2003.00403", "title": "Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension", "authors": ["Zhenfang Chen", "Peng Wang", "Lin Ma", "Kwan-Yee K. Wong", "Qi Wu"], "published": "2020-03-01 04:59:38", "updated": "2020-03-01 04:59:38", "summary": "Referring expression comprehension (REF) aims at identifying a particular\nobject in a scene by a natural language expression. It requires joint reasoning\nover the textual and visual domains to solve the problem. Some popular\nreferring expression datasets, however, fail to provide an ideal test bed for\nevaluating the reasoning ability of the models, mainly because 1) their\nexpressions typically describe only some simple distinctive properties of the\nobject and 2) their images contain limited distracting information. To bridge\nthe gap, we propose a new dataset for visual reasoning in context of referring\nexpression comprehension with two main features. First, we design a novel\nexpression engine rendering various reasoning logics that can be flexibly\ncombined with rich visual properties to generate expressions with varying\ncompositionality. Second, to better exploit the full reasoning chain embodied\nin an expression, we propose a new test setting by adding additional\ndistracting images containing objects sharing similar properties with the\nreferent, thus minimising the success rate of reasoning-free cross-domain\nalignment. We evaluate several state-of-the-art REF models, but find none of\nthem can achieve promising performance. A proposed modular hard mining strategy\nperforms the best but still leaves substantial room for improvement. We hope\nthis new dataset and task can serve as a benchmark for deeper visual reasoning\nanalysis and foster the research on referring expression comprehension.", "comment": "To appear in CVPR2020", "links": []}
{"entry_id": "1908.08530", "title": "VL-BERT: Pre-training of Generic Visual-Linguistic Representations", "authors": ["Weijie Su", "Xizhou Zhu", "Yue Cao", "Bin Li", "Lewei Lu", "Furu Wei", "Jifeng Dai"], "published": "2019-08-22 17:59:30", "updated": "2020-02-18 02:59:17", "summary": "We introduce a new pre-trainable generic representation for visual-linguistic\ntasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the\nsimple yet powerful Transformer model as the backbone, and extends it to take\nboth visual and linguistic embedded features as input. In it, each element of\nthe input is either of a word from the input sentence, or a region-of-interest\n(RoI) from the input image. It is designed to fit for most of the\nvisual-linguistic downstream tasks. To better exploit the generic\nrepresentation, we pre-train VL-BERT on the massive-scale Conceptual Captions\ndataset, together with text-only corpus. Extensive empirical analysis\ndemonstrates that the pre-training procedure can better align the\nvisual-linguistic clues and benefit the downstream tasks, such as visual\ncommonsense reasoning, visual question answering and referring expression\ncomprehension. It is worth noting that VL-BERT achieved the first place of\nsingle model on the leaderboard of the VCR benchmark. Code is released at\n\\url{https://github.com/jackroos/VL-BERT}.", "comment": "Accepted by ICLR 2020", "links": []}
{"entry_id": "1911.11938", "title": "Transfer Learning in Visual and Relational Reasoning", "authors": ["T. S. Jayram", "Vincent Marois", "Tomasz Kornuta", "Vincent Albouy", "Emre Sevgen", "Ahmet S. Ozcan"], "published": "2019-11-27 03:54:15", "updated": "2020-02-15 04:26:42", "summary": "Transfer learning has become the de facto standard in computer vision and\nnatural language processing, especially where labeled data is scarce. Accuracy\ncan be significantly improved by using pre-trained models and subsequent\nfine-tuning. In visual reasoning tasks, such as image question answering,\ntransfer learning is more complex. In addition to transferring the capability\nto recognize visual features, we also expect to transfer the system's ability\nto reason. Moreover, for video data, temporal reasoning adds another dimension.\nIn this work, we formalize these unique aspects of transfer learning and\npropose a theoretical framework for visual reasoning, exemplified by the\nwell-established CLEVR and COG datasets. Furthermore, we introduce a new,\nend-to-end differentiable recurrent model (SAMNet), which shows\nstate-of-the-art accuracy and better performance in transfer learning on both\ndatasets. The improved performance of SAMNet stems from its capability to\ndecouple the abstract multi-step reasoning from the length of the sequence and\nits selective attention enabling to store only the question-relevant objects in\nthe external memory.", "comment": "18 pages; more baseline comparisons; additional clarifications", "links": []}
{"entry_id": "1910.14671", "title": "TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines", "authors": ["Jingxiang Lin", "Unnat Jain", "Alexander G. Schwing"], "published": "2019-10-31 17:59:57", "updated": "2020-01-09 15:55:26", "summary": "Reasoning is an important ability that we learn from a very early age. Yet,\nreasoning is extremely hard for algorithms. Despite impressive recent progress\nthat has been reported on tasks that necessitate reasoning, such as visual\nquestion answering and visual dialog, models often exploit biases in datasets.\nTo develop models with better reasoning abilities, recently, the new visual\ncommonsense reasoning (VCR) task has been introduced. Not only do models have\nto answer questions, but also do they have to provide a reason for the given\nanswer. The proposed baseline achieved compelling results, leveraging a\nmeticulously designed model composed of LSTM modules and attention nets. Here\nwe show that a much simpler model obtained by ablating and pruning the existing\nintricate baseline can perform better with half the number of trainable\nparameters. By associating visual features with attribute information and\nbetter text to image grounding, we obtain further improvements for our simpler\n& effective baseline, TAB-VCR. We show that this approach results in a 5.3%,\n4.4% and 6.5% absolute improvement over the previous state-of-the-art on\nquestion answering, answer justification and holistic VCR.", "comment": "Accepted to NeurIPS 2019. Project page:\n  https://deanplayerljx.github.io/tabvcr", "links": []}
{"entry_id": "1905.12506", "title": "Are Disentangled Representations Helpful for Abstract Visual Reasoning?", "authors": ["Sjoerd van Steenkiste", "Francesco Locatello", "Jürgen Schmidhuber", "Olivier Bachem"], "published": "2019-05-29 14:52:32", "updated": "2020-01-07 14:36:07", "summary": "A disentangled representation encodes information about the salient factors\nof variation in the data independently. Although it is often argued that this\nrepresentational format is useful in learning to solve many real-world\ndown-stream tasks, there is little empirical evidence that supports this claim.\nIn this paper, we conduct a large-scale study that investigates whether\ndisentangled representations are more suitable for abstract reasoning tasks.\nUsing two new tasks similar to Raven's Progressive Matrices, we evaluate the\nusefulness of the representations learned by 360 state-of-the-art unsupervised\ndisentanglement models. Based on these representations, we train 3600 abstract\nreasoning models and observe that disentangled representations do in fact lead\nto better down-stream performance. In particular, they enable quicker learning\nusing fewer samples.", "comment": "Accepted to NeurIPS 2019", "links": []}
{"entry_id": "1910.11124", "title": "Enforcing Reasoning in Visual Commonsense Reasoning", "authors": ["Hammad A. Ayyubi", "Md. Mehrab Tanjim", "David J. Kriegman"], "published": "2019-10-21 02:33:18", "updated": "2019-12-27 10:09:58", "summary": "The task of Visual Commonsense Reasoning is extremely challenging in the\nsense that the model has to not only be able to answer a question given an\nimage, but also be able to learn to reason. The baselines introduced in this\ntask are quite limiting because two networks are trained for predicting answers\nand rationales separately. Question and image is used as input to train answer\nprediction network while question, image and correct answer are used as input\nin the rationale prediction network. As rationale is conditioned on the correct\nanswer, it is based on the assumption that we can solve Visual Question\nAnswering task without any error - which is over ambitious. Moreover, such an\napproach makes both answer and rationale prediction two completely independent\nVQA tasks rendering cognition task meaningless. In this paper, we seek to\naddress these issues by proposing an end-to-end trainable model which considers\nboth answers and their reasons jointly. Specifically, we first predict the\nanswer for the question and then use the chosen answer to predict the\nrationale. However, a trivial design of such a model becomes non-differentiable\nwhich makes it difficult to train. We solve this issue by proposing four\napproaches - softmax, gumbel-softmax, reinforcement learning based sampling and\ndirect cross entropy against all pairs of answers and rationales. We\ndemonstrate through experiments that our model performs competitively against\ncurrent state-of-the-art. We conclude with an analysis of presented approaches\nand discuss avenues for further work.", "comment": null, "links": []}
{"entry_id": "1905.11666", "title": "Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning", "authors": ["Wonjae Kim", "Yoonho Lee"], "published": "2019-05-28 08:13:37", "updated": "2019-12-23 05:37:28", "summary": "Without relevant human priors, neural networks may learn uninterpretable\nfeatures. We propose Dynamics of Attention for Focus Transition (DAFT) as a\nhuman prior for machine reasoning. DAFT is a novel method that regularizes\nattention-based reasoning by modelling it as a continuous dynamical system\nusing neural ordinary differential equations. As a proof of concept, we augment\na state-of-the-art visual reasoning model with DAFT. Our experiments reveal\nthat applying DAFT yields similar performance to the original model while using\nfewer reasoning steps, showing that it implicitly learns to skip unnecessary\nsteps. We also propose a new metric, Total Length of Transition (TLT), which\nrepresents the effective reasoning step size by quantifying how much a given\nmodel's focus drifts while reasoning about a question. We show that adding DAFT\nresults in lower TLT, demonstrating that our method indeed obeys the human\nprior towards shorter reasoning paths in addition to producing more\ninterpretable attention maps. Our code is available at\nhttps://github.com/kakao/DAFT.", "comment": "20 pages, 18 figures, 2 tables", "links": []}
{"entry_id": "1912.09589", "title": "Smart Home Appliances: Chat with Your Fridge", "authors": ["Denis Gudovskiy", "Gyuri Han", "Takuya Yamaguchi", "Sotaro Tsukizawa"], "published": "2019-12-19 23:12:25", "updated": "2019-12-19 23:12:25", "summary": "Current home appliances are capable to execute a limited number of voice\ncommands such as turning devices on or off, adjusting music volume or light\nconditions. Recent progress in machine reasoning gives an opportunity to\ndevelop new types of conversational user interfaces for home appliances. In\nthis paper, we apply state-of-the-art visual reasoning model and demonstrate\nthat it is feasible to ask a smart fridge about its contents and various\nproperties of the food with close-to-natural conversation experience. Our\nvisual reasoning model answers user questions about existence, count, category\nand freshness of each product by analyzing photos made by the image sensor\ninside the smart fridge. Users may chat with their fridge using off-the-shelf\nphone messenger while being away from home, for example, when shopping in the\nsupermarket. We generate a visually realistic synthetic dataset to train\nmachine learning reasoning model that achieves 95% answer accuracy on test\ndata. We present the results of initial user tests and discuss how we modify\ndistribution of generated questions for model training based on\nhuman-in-the-loop guidance. We open source code for the whole system including\ndataset generation, reasoning model and demonstration scripts.", "comment": "NeurIPS 2019 demo track", "links": []}
{"entry_id": "1908.07490", "title": "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", "authors": ["Hao Tan", "Mohit Bansal"], "published": "2019-08-20 17:05:18", "updated": "2019-12-03 19:30:19", "summary": "Vision-and-language reasoning requires an understanding of visual concepts,\nlanguage semantics, and, most importantly, the alignment and relationships\nbetween these two modalities. We thus propose the LXMERT (Learning\nCross-Modality Encoder Representations from Transformers) framework to learn\nthese vision-and-language connections. In LXMERT, we build a large-scale\nTransformer model that consists of three encoders: an object relationship\nencoder, a language encoder, and a cross-modality encoder. Next, to endow our\nmodel with the capability of connecting vision and language semantics, we\npre-train the model with large amounts of image-and-sentence pairs, via five\ndiverse representative pre-training tasks: masked language modeling, masked\nobject prediction (feature regression and label classification), cross-modality\nmatching, and image question answering. These tasks help in learning both\nintra-modality and cross-modality relationships. After fine-tuning from our\npre-trained parameters, our model achieves the state-of-the-art results on two\nvisual question answering datasets (i.e., VQA and GQA). We also show the\ngeneralizability of our pre-trained cross-modality model by adapting it to a\nchallenging visual-reasoning task, NLVR2, and improve the previous best result\nby 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies\nto prove that both our novel model components and pre-training strategies\nsignificantly contribute to our strong results; and also present several\nattention visualizations for the different encoders. Code and pre-trained\nmodels publicly available at: https://github.com/airsplay/lxmert", "comment": "EMNLP 2019 (14 pages; with new attention visualizations)", "links": []}
{"entry_id": "1908.06066", "title": "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training", "authors": ["Gen Li", "Nan Duan", "Yuejian Fang", "Ming Gong", "Daxin Jiang", "Ming Zhou"], "published": "2019-08-16 17:26:56", "updated": "2019-12-02 10:15:38", "summary": "We propose Unicoder-VL, a universal encoder that aims to learn joint\nrepresentations of vision and language in a pre-training manner. Borrow ideas\nfrom cross-lingual pre-trained models, such as XLM and Unicoder, both visual\nand linguistic contents are fed into a multi-layer Transformer for the\ncross-modal pre-training, where three pre-trained tasks are employed, including\nMasked Language Modeling (MLM), Masked Object Classification (MOC) and\nVisual-linguistic Matching (VLM). The first two tasks learn context-aware\nrepresentations for input tokens based on linguistic and visual contents\njointly. The last task tries to predict whether an image and a text describe\neach other. After pretraining on large-scale image-caption pairs, we transfer\nUnicoder-VL to caption-based image-text retrieval and visual commonsense\nreasoning, with just one additional output layer. We achieve state-of-the-art\nor comparable results on both two tasks and show the powerful ability of the\ncross-modal pre-training.", "comment": "accepted by AAAI-2020. arXiv admin note: text overlap with\n  arXiv:1909.11740 by other authors", "links": []}
{"entry_id": "1911.07736", "title": "Modeling Gestalt Visual Reasoning on the Raven's Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques", "authors": ["Tianyu Hua", "Maithilee Kunda"], "published": "2019-11-18 16:16:55", "updated": "2019-11-26 08:32:20", "summary": "Psychologists recognize Raven's Progressive Matrices as a very effective test\nof general human intelligence. While many computational models have been\ndeveloped by the AI community to investigate different forms of top-down,\ndeliberative reasoning on the test, there has been less research on bottom-up\nperceptual processes, like Gestalt image completion, that are also critical in\nhuman test performance. In this work, we investigate how Gestalt visual\nreasoning on the Raven's test can be modeled using generative image inpainting\ntechniques from computer vision. We demonstrate that a self-supervised\ninpainting model trained only on photorealistic images of objects achieves a\nscore of 27/36 on the Colored Progressive Matrices, which corresponds to\naverage performance for nine-year-old children. We also show that models\ntrained on other datasets (faces, places, and textures) do not perform as well.\nOur results illustrate how learning visual regularities in real-world images\ncan translate into successful reasoning about artificial test stimuli. On the\nflip side, our results also highlight the limitations of such transfer, which\nmay explain why intelligence tests like the Raven's are often sensitive to\npeople's individual sociocultural backgrounds.", "comment": null, "links": []}
{"entry_id": "1907.03950", "title": "Learning by Abstraction: The Neural State Machine", "authors": ["Drew A. Hudson", "Christopher D. Manning"], "published": "2019-07-09 03:08:41", "updated": "2019-11-25 10:02:05", "summary": "We introduce the Neural State Machine, seeking to bridge the gap between the\nneural and symbolic views of AI and integrate their complementary strengths for\nthe task of visual reasoning. Given an image, we first predict a probabilistic\ngraph that represents its underlying semantics and serves as a structured world\nmodel. Then, we perform sequential reasoning over the graph, iteratively\ntraversing its nodes to answer a given question or draw a new inference. In\ncontrast to most neural architectures that are designed to closely interact\nwith the raw sensory data, our model operates instead in an abstract latent\nspace, by transforming both the visual and linguistic modalities into semantic\nconcept-based representations, thereby achieving enhanced transparency and\nmodularity. We evaluate our model on VQA-CP and GQA, two recent VQA datasets\nthat involve compositionality, multi-step inference and diverse reasoning\nskills, achieving state-of-the-art results in both cases. We provide further\nexperiments that illustrate the model's strong generalization capacity across\nmultiple dimensions, including novel compositions of concepts, changes in the\nanswer distribution, and unseen linguistic structures, demonstrating the\nqualities and efficacy of our approach.", "comment": "Published as a conference paper at NeurIPS 2019 (spotlight)", "links": []}
{"entry_id": "1911.09655", "title": "Temporal Reasoning via Audio Question Answering", "authors": ["Haytham M. Fayek", "Justin Johnson"], "published": "2019-11-21 18:26:30", "updated": "2019-11-21 18:26:30", "summary": "Multimodal question answering tasks can be used as proxy tasks to study\nsystems that can perceive and reason about the world. Answering questions about\ndifferent types of input modalities stresses different aspects of reasoning\nsuch as visual reasoning, reading comprehension, story understanding, or\nnavigation. In this paper, we use the task of Audio Question Answering (AQA) to\nstudy the temporal reasoning abilities of machine learning models. To this end,\nwe introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising\naudio sequences of natural sound events and programmatically generated\nquestions and answers that probe various aspects of temporal reasoning. We\nadapt several recent state-of-the-art methods for visual question answering to\nthe AQA task, and use DAQA to demonstrate that they perform poorly on questions\nthat require in-depth temporal reasoning. Finally, we propose a new model,\nMultiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the\nrecent Feature-wise Linear Modulation (FiLM) model and significantly improves\nits temporal reasoning capabilities. We envisage DAQA to foster research on AQA\nand temporal reasoning and MALiMo a step towards models for AQA.", "comment": null, "links": []}
{"entry_id": "1911.09375", "title": "ChartNet: Visual Reasoning over Statistical Charts using MAC-Networks", "authors": ["Monika Sharma", "Shikha Gupta", "Arindam Chowdhury", "Lovekesh Vig"], "published": "2019-11-21 10:03:25", "updated": "2019-11-21 10:03:25", "summary": "Despite the improvements in perception accuracies brought about via deep\nlearning, developing systems combining accurate visual perception with the\nability to reason over the visual percepts remains extremely challenging. A\nparticular application area of interest from an accessibility perspective is\nthat of reasoning over statistical charts such as bar and pie charts. To this\nend, we formulate the problem of reasoning over statistical charts as a\nclassification task using MAC-Networks to give answers from a predefined\nvocabulary of generic answers. Additionally, we enhance the capabilities of\nMAC-Networks to give chart-specific answers to open-ended questions by\nreplacing the classification layer by a regression layer to localize the\ntextual answers present over the images. We call our network ChartNet, and\ndemonstrate its efficacy on predicting both in vocabulary and out of vocabulary\nanswers. To test our methods, we generated our own dataset of statistical chart\nimages and corresponding question answer pairs. Results show that ChartNet\nconsistently outperform other state-of-the-art methods on reasoning over these\nquestions and may be a viable candidate for applications containing images of\nstatistical charts.", "comment": null, "links": ["http://dx.doi.org/10.1109/IJCNN.2019.8852427"]}
{"entry_id": "1911.07721", "title": "Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test", "authors": ["Lu Yihe", "Scott C. Lowe", "Penelope A. Lewis", "Mark C. W. van Rossum"], "published": "2019-11-18 15:47:03", "updated": "2019-11-19 12:32:25", "summary": "Despite remarkable advances in automated visual recognition by machines, some\nvisual tasks remain challenging for machines. Fleuret et al. (2011) introduced\nthe Synthetic Visual Reasoning Test (SVRT) to highlight this point, which\nrequired classification of images consisting of randomly generated shapes based\non hidden abstract rules using only a few examples. Ellis et al. (2015)\ndemonstrated that a program synthesis approach could solve some of the SVRT\nproblems with unsupervised, few-shot learning, whereas they remained\nchallenging for several convolutional neural networks trained with thousands of\nexamples. Here we re-considered the human and machine experiments, because they\nfollowed different protocols and yielded different statistics. We thus proposed\na quantitative reintepretation of the data between the protocols, so that we\ncould make fair comparison between human and machine performance. We improved\nthe program synthesis classifier by correcting the image parsings, and compared\nthe results to the performance of other machine agents and human subjects. We\ngrouped the SVRT problems into different types by the two aspects of the core\ncharacteristics for classification: shape specification and location relation.\nWe found that the program synthesis classifier could not solve problems\ninvolving shape distances, because it relied on symbolic computation which\nscales poorly with input dimension and adding distances into such computation\nwould increase the dimension combinatorially with the number of shapes in an\nimage. Therefore, although the program synthesis classifier is capable of\nabstract reasoning, its performance is highly constrained by the accessible\ninformation in image parsings.", "comment": null, "links": []}
{"entry_id": "1911.05990", "title": "Attention on Abstract Visual Reasoning", "authors": ["Lukas Hahne", "Timo Lüddecke", "Florentin Wörgötter", "David Kappel"], "published": "2019-11-14 08:33:40", "updated": "2019-11-14 08:33:40", "summary": "Attention mechanisms have been boosting the performance of deep learning\nmodels on a wide range of applications, ranging from speech understanding to\nprogram induction. However, despite experiments from psychology which suggest\nthat attention plays an essential role in visual reasoning, the full potential\nof attention mechanisms has so far not been explored to solve abstract\ncognitive tasks on image data. In this work, we propose a hybrid network\narchitecture, grounded on self-attention and relational reasoning. We call this\nnew model Attention Relation Network (ARNe). ARNe combines features from the\nrecently introduced Transformer and the Wild Relation Network (WReN). We test\nARNe on the Procedurally Generated Matrices (PGMs) datasets for abstract visual\nreasoning. ARNe excels the WReN model on this task by 11.28 ppt. Relational\nconcepts between objects are efficiently learned demanding only 35% of the\ntraining samples to surpass reported accuracy of the base line model. Our\nproposed hybrid model, represents an alternative on learning abstract relations\nusing self-attention and demonstrates that the Transformer network is also well\nsuited for abstract visual reasoning.", "comment": null, "links": []}
{"entry_id": "1908.05054", "title": "Fusion of Detected Objects in Text for Visual Question Answering", "authors": ["Chris Alberti", "Jeffrey Ling", "Michael Collins", "David Reitter"], "published": "2019-08-14 10:03:12", "updated": "2019-11-03 05:04:09", "summary": "To advance models of multimodal context, we introduce a simple yet powerful\nneural architecture for data that combines vision and natural language. The\n\"Bounding Boxes in Text Transformer\" (B2T2) also leverages referential\ninformation binding words to portions of the image in a single unified\narchitecture. B2T2 is highly effective on the Visual Commonsense Reasoning\nbenchmark (https://visualcommonsense.com), achieving a new state-of-the-art\nwith a 25% relative reduction in error rate compared to published baselines and\nobtaining the best performance to date on the public leaderboard (as of May 22,\n2019). A detailed ablation analysis shows that the early integration of the\nvisual features into the text analysis is key to the effectiveness of the new\narchitecture. A reference implementation of our models is provided\n(https://github.com/google-research/language/tree/master/language/question_answering/b2t2).", "comment": null, "links": []}
{"entry_id": "1910.03343", "title": "Modulated Self-attention Convolutional Network for VQA", "authors": ["Jean-Benoit Delbrouck", "Antoine Maiorca", "Nathan Hubens", "Stéphane Dupont"], "published": "2019-10-08 11:28:38", "updated": "2019-10-31 16:59:23", "summary": "As new data-sets for real-world visual reasoning and compositional question\nanswering are emerging, it might be needed to use the visual feature extraction\nas a end-to-end process during training. This small contribution aims to\nsuggest new ideas to improve the visual processing of traditional convolutional\nnetwork for visual question answering (VQA). In this paper, we propose to\nmodulate by a linguistic input a CNN augmented with self-attention. We show\nencouraging relative improvements for future research in this direction.", "comment": "Accepted at NeurIPS 2019 workshop: ViGIL", "links": []}
{"entry_id": "1910.11475", "title": "Heterogeneous Graph Learning for Visual Commonsense Reasoning", "authors": ["Weijiang Yu", "Jingwen Zhou", "Weihao Yu", "Xiaodan Liang", "Nong Xiao"], "published": "2019-10-25 01:04:46", "updated": "2019-10-25 01:04:46", "summary": "Visual commonsense reasoning task aims at leading the research field into\nsolving cognition-level reasoning with the ability of predicting correct\nanswers and meanwhile providing convincing reasoning paths, resulting in three\nsub-tasks i.e., Q->A, QA->R and Q->AR. It poses great challenges over the\nproper semantic alignment between vision and linguistic domains and knowledge\nreasoning to generate persuasive reasoning paths. Existing works either resort\nto a powerful end-to-end network that cannot produce interpretable reasoning\npaths or solely explore intra-relationship of visual objects (homogeneous\ngraph) while ignoring the cross-domain semantic alignment among visual concepts\nand linguistic words. In this paper, we propose a new Heterogeneous Graph\nLearning (HGL) framework for seamlessly integrating the intra-graph and\ninter-graph reasoning in order to bridge vision and language domain. Our HGL\nconsists of a primal vision-to-answer heterogeneous graph (VAHG) module and a\ndual question-to-answer heterogeneous graph (QAHG) module to interactively\nrefine reasoning paths for semantic agreement. Moreover, our HGL integrates a\ncontextual voting module to exploit a long-range visual context for better\nglobal reasoning. Experiments on the large-scale Visual Commonsense Reasoning\nbenchmark demonstrate the superior performance of our proposed modules on three\ntasks (improving 5% accuracy on Q->A, 3.5% on QA->R, 5.8% on Q->AR)", "comment": "11 pages, 5 figures", "links": []}
{"entry_id": "1909.09065", "title": "Towards Explainable Neural-Symbolic Visual Reasoning", "authors": ["Adrien Bennetot", "Jean-Luc Laurent", "Raja Chatila", "Natalia Díaz-Rodríguez"], "published": "2019-09-19 16:04:57", "updated": "2019-10-22 15:23:49", "summary": "Many high-performance models suffer from a lack of interpretability. There\nhas been an increasing influx of work on explainable artificial intelligence\n(XAI) in order to disentangle what is meant and expected by XAI. Nevertheless,\nthere is no general consensus on how to produce and judge explanations. In this\npaper, we discuss why techniques integrating connectionist and symbolic\nparadigms are the most efficient solutions to produce explanations for\nnon-technical users and we propose a reasoning model, based on definitions by\nDoran et al. [2017] (arXiv:1710.00794) to explain a neural network's decision.\nWe use this explanation in order to correct bias in the network's decision\nrationale. We accompany this model with an example of its potential use, based\non the image captioning method in Burns et al. [2018] (arXiv:1803.09797).", "comment": "Accepted at IJCAI19 Neural-Symbolic Learning and Reasoning Workshop\n  (https://sites.google.com/view/nesy2019/home)", "links": []}
{"entry_id": "1812.03299", "title": "Learning to Assemble Neural Module Tree Networks for Visual Grounding", "authors": ["Daqing Liu", "Hanwang Zhang", "Feng Wu", "Zheng-Jun Zha"], "published": "2018-12-08 11:04:34", "updated": "2019-10-21 12:31:10", "summary": "Visual grounding, a task to ground (i.e., localize) natural language in\nimages, essentially requires composite visual reasoning. However, existing\nmethods over-simplify the composite nature of language into a monolithic\nsentence embedding or a coarse composition of subject-predicate-object triplet.\nIn this paper, we propose to ground natural language in an intuitive,\nexplainable, and composite fashion as it should be. In particular, we develop a\nnovel modular network called Neural Module Tree network (NMTree) that\nregularizes the visual grounding along the dependency parsing tree of the\nsentence, where each node is a neural module that calculates visual attention\naccording to its linguistic feature, and the grounding score is accumulated in\na bottom-up direction where as needed. NMTree disentangles the visual grounding\nfrom the composite reasoning, allowing the former to only focus on primitive\nand easy-to-generalize patterns. To reduce the impact of parsing errors, we\ntrain the modules and their assembly end-to-end by using the Gumbel-Softmax\napproximation and its straight-through gradient estimator, accounting for the\ndiscrete nature of module assembly. Overall, the proposed NMTree consistently\noutperforms the state-of-the-arts on several benchmarks. Qualitative results\nshow explainable grounding score calculation in great detail.", "comment": "Accepted at ICCV 2019 (Oral); Code available at\n  https://github.com/daqingliu/NMTree", "links": []}
{"entry_id": "1910.01833", "title": "Few-Shot Abstract Visual Reasoning With Spectral Features", "authors": ["Tanner Bohn", "Yining Hu", "Charles X. Ling"], "published": "2019-10-04 08:15:15", "updated": "2019-10-04 08:15:15", "summary": "We present an image preprocessing technique capable of improving the\nperformance of few-shot classifiers on abstract visual reasoning tasks. Many\nvisual reasoning tasks with abstract features are easy for humans to learn with\nfew examples but very difficult for computer vision approaches with the same\nnumber of samples, despite the ability for deep learning models to learn\nabstract features. Same-different (SD) problems represent a type of visual\nreasoning task requiring knowledge of pattern repetition within individual\nimages, and modern computer vision approaches have largely faltered on these\nclassification problems, even when provided with vast amounts of training data.\nWe propose a simple method for solving these problems based on the insight that\nremoving peaks from the amplitude spectrum of an image is capable of\nemphasizing the unique parts of the image. When combined with several\nclassifiers, our method performs well on the SD SVRT tasks with few-shot\nlearning, improving upon the best comparable results on all tasks, with average\nabsolute accuracy increases nearly 40% for some classifiers. In particular, we\nfind that combining Relational Networks with this image preprocessing approach\nimproves their performance from chance-level to over 90% accuracy on several SD\ntasks.", "comment": "11 pages, 3 figures", "links": []}
{"entry_id": "1909.08859", "title": "Procedural Reasoning Networks for Understanding Multimodal Procedures", "authors": ["Mustafa Sercan Amac", "Semih Yagcioglu", "Aykut Erdem", "Erkut Erdem"], "published": "2019-09-19 08:39:00", "updated": "2019-09-19 08:39:00", "summary": "This paper addresses the problem of comprehending procedural commonsense\nknowledge. This is a challenging task as it requires identifying key entities,\nkeeping track of their state changes, and understanding temporal and causal\nrelations. Contrary to most of the previous work, in this study, we do not rely\non strong inductive bias and explore the question of how multimodality can be\nexploited to provide a complementary semantic signal. Towards this end, we\nintroduce a new entity-aware neural comprehension model augmented with external\nrelational memory units. Our model learns to dynamically update entity states\nin relation to each other while reading the text instructions. Our experimental\nanalysis on the visual reasoning tasks in the recently proposed RecipeQA\ndataset reveals that our approach improves the accuracy of the previously\nreported models by a large margin. Moreover, we find that our model learns\neffective dynamic representations of entities even though we do not use any\nsupervision at the level of entity states.", "comment": "Accepted to CoNLL 2019. The project website with code and demo is\n  available at https://hucvl.github.io/prn/", "links": []}
{"entry_id": "1909.08164", "title": "Dynamic Graph Attention for Referring Expression Comprehension", "authors": ["Sibei Yang", "Guanbin Li", "Yizhou Yu"], "published": "2019-09-18 01:47:27", "updated": "2019-09-18 01:47:27", "summary": "Referring expression comprehension aims to locate the object instance\ndescribed by a natural language referring expression in an image. This task is\ncompositional and inherently requires visual reasoning on top of the\nrelationships among the objects in the image. Meanwhile, the visual reasoning\nprocess is guided by the linguistic structure of the referring expression.\nHowever, existing approaches treat the objects in isolation or only explore the\nfirst-order relationships between objects without being aligned with the\npotential complexity of the expression. Thus it is hard for them to adapt to\nthe grounding of complex referring expressions. In this paper, we explore the\nproblem of referring expression comprehension from the perspective of\nlanguage-driven visual reasoning, and propose a dynamic graph attention network\nto perform multi-step reasoning by modeling both the relationships among the\nobjects in the image and the linguistic structure of the expression. In\nparticular, we construct a graph for the image with the nodes and edges\ncorresponding to the objects and their relationships respectively, propose a\ndifferential analyzer to predict a language-guided visual reasoning process,\nand perform stepwise reasoning on top of the graph to update the compound\nobject representation at every node. Experimental results demonstrate that the\nproposed method can not only significantly surpass all existing\nstate-of-the-art algorithms across three common benchmark datasets, but also\ngenerate interpretable visual evidences for stepwisely locating the objects\nreferred to in complex language descriptions.", "comment": "Accepted as an Oral presentation at ICCV2019", "links": []}
{"entry_id": "1907.06794", "title": "2nd Place Solution to the GQA Challenge 2019", "authors": ["Shijie Geng", "Ji Zhang", "Hang Zhang", "Ahmed Elgammal", "Dimitris N. Metaxas"], "published": "2019-07-16 00:09:09", "updated": "2019-08-16 22:04:53", "summary": "We present a simple method that achieves unexpectedly superior performance\nfor Complex Reasoning involved Visual Question Answering. Our solution collects\nstatistical features from high-frequency words of all the questions asked about\nan image and use them as accurate knowledge for answering further questions of\nthe same image. We are fully aware that this setting is not ubiquitously\napplicable, and in a more common setting one should assume the questions are\nasked separately and they cannot be gathered to obtain a knowledge base.\nNonetheless, we use this method as an evidence to demonstrate our observation\nthat the bottleneck effect is more severe on the feature extraction part than\nit is on the knowledge reasoning part. We show significant gaps when using the\nsame reasoning model with 1) ground-truth features; 2) statistical features; 3)\ndetected features from completely learned detectors, and analyze what these\ngaps mean to researches on visual reasoning topics. Our model with the\nstatistical features achieves the 2nd place in the GQA Challenge 2019.", "comment": null, "links": []}
{"entry_id": "1908.02265", "title": "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", "authors": ["Jiasen Lu", "Dhruv Batra", "Devi Parikh", "Stefan Lee"], "published": "2019-08-06 17:33:52", "updated": "2019-08-06 17:33:52", "summary": "We present ViLBERT (short for Vision-and-Language BERT), a model for learning\ntask-agnostic joint representations of image content and natural language. We\nextend the popular BERT architecture to a multi-modal two-stream model,\npro-cessing both visual and textual inputs in separate streams that interact\nthrough co-attentional transformer layers. We pretrain our model through two\nproxy tasks on the large, automatically collected Conceptual Captions dataset\nand then transfer it to multiple established vision-and-language tasks --\nvisual question answering, visual commonsense reasoning, referring expressions,\nand caption-based image retrieval -- by making only minor additions to the base\narchitecture. We observe significant improvements across tasks compared to\nexisting task-specific models -- achieving state-of-the-art on all four tasks.\nOur work represents a shift away from learning groundings between vision and\nlanguage only as part of task training and towards treating visual grounding as\na pretrainable and transferable capability.", "comment": "11 pages, 5 figures", "links": []}
{"entry_id": "1907.12271", "title": "V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices", "authors": ["Damien Teney", "Peng Wang", "Jiewei Cao", "Lingqiao Liu", "Chunhua Shen", "Anton van den Hengel"], "published": "2019-07-29 08:28:33", "updated": "2019-07-29 08:28:33", "summary": "One of the primary challenges faced by deep learning is the degree to which\ncurrent methods exploit superficial statistics and dataset bias, rather than\nlearning to generalise over the specific representations they have experienced.\nThis is a critical concern because generalisation enables robust reasoning over\nunseen data, whereas leveraging superficial statistics is fragile to even small\nchanges in data distribution. To illuminate the issue and drive progress\ntowards a solution, we propose a test that explicitly evaluates abstract\nreasoning over visual data. We introduce a large-scale benchmark of visual\nquestions that involve operations fundamental to many high-level vision tasks,\nsuch as comparisons of counts and logical operations on complex visual\nproperties. The benchmark directly measures a method's ability to infer\nhigh-level relationships and to generalise them over image-based concepts. It\nincludes multiple training/test splits that require controlled levels of\ngeneralization. We evaluate a range of deep learning architectures, and find\nthat existing models, including those popular for vision-and-language tasks,\nare unable to solve seemingly-simple instances. Models using relational\nnetworks fare better but leave substantial room for improvement.", "comment": null, "links": []}
{"entry_id": "1811.00491", "title": "A Corpus for Reasoning About Natural Language Grounded in Photographs", "authors": ["Alane Suhr", "Stephanie Zhou", "Ally Zhang", "Iris Zhang", "Huajun Bai", "Yoav Artzi"], "published": "2018-11-01 16:47:44", "updated": "2019-07-21 05:26:36", "summary": "We introduce a new dataset for joint reasoning about natural language and\nimages, with a focus on semantic diversity, compositionality, and visual\nreasoning challenges. The data contains 107,292 examples of English sentences\npaired with web photographs. The task is to determine whether a natural\nlanguage caption is true about a pair of photographs. We crowdsource the data\nusing sets of visually rich images and a compare-and-contrast task to elicit\nlinguistically diverse language. Qualitative analysis shows the data requires\ncompositional joint reasoning, including about quantities, comparisons, and\nrelations. Evaluation using state-of-the-art visual reasoning methods shows the\ndata presents a strong challenge.", "comment": "ACL 2019 Long Paper", "links": []}
{"entry_id": "1902.03380", "title": "When Causal Intervention Meets Adversarial Examples and Image Masking for Deep Neural Networks", "authors": ["Chao-Han Huck Yang", "Yi-Chieh Liu", "Pin-Yu Chen", "Xiaoli Ma", "Yi-Chang James Tsai"], "published": "2019-02-09 06:44:13", "updated": "2019-06-25 15:07:42", "summary": "Discovering and exploiting the causality in deep neural networks (DNNs) are\ncrucial challenges for understanding and reasoning causal effects (CE) on an\nexplainable visual model. \"Intervention\" has been widely used for recognizing a\ncausal relation ontologically. In this paper, we propose a causal inference\nframework for visual reasoning via do-calculus. To study the intervention\neffects on pixel-level features for causal reasoning, we introduce pixel-wise\nmasking and adversarial perturbation. In our framework, CE is calculated using\nfeatures in a latent space and perturbed prediction from a DNN-based model. We\nfurther provide the first look into the characteristics of discovered CE of\nadversarially perturbed images generated by gradient-based methods\n\\footnote{~~https://github.com/jjaacckkyy63/Causal-Intervention-AE-wAdvImg}.\nExperimental results show that CE is a competitive and robust index for\nunderstanding DNNs when compared with conventional methods such as\nclass-activation mappings (CAMs) on the Chest X-Ray-14 dataset for\nhuman-interpretable feature(s) (e.g., symptom) reasoning. Moreover, CE holds\npromises for detecting adversarial examples as it possesses distinct\ncharacteristics in the presence of adversarial perturbations.", "comment": "Noted our camera-ready version has changed the title. \"When Causal\n  Intervention Meets Adversarial Examples and Image Masking for Deep Neural\n  Networks\" as the v3 official paper title in IEEE Proceeding. Please use it in\n  your formal reference. Accepted at IEEE ICIP 2019. Pytorch code has released\n  on https://github.com/jjaacckkyy63/Causal-Intervention-AE-wAdvImg", "links": ["http://dx.doi.org/10.1109/ICIP.2019.8803554"]}
{"entry_id": "1905.10226", "title": "Deep Reason: A Strong Baseline for Real-World Visual Reasoning", "authors": ["Chenfei Wu", "Yanzhao Zhou", "Gen Li", "Nan Duan", "Duyu Tang", "Xiaojie Wang"], "published": "2019-05-24 13:34:21", "updated": "2019-06-17 15:26:58", "summary": "This paper presents a strong baseline for real-world visual reasoning (GQA),\nwhich achieves 60.93% in GQA 2019 challenge and won the sixth place. GQA is a\nlarge dataset with 22M questions involving spatial understanding and multi-step\ninference. To help further research in this area, we identified three crucial\nparts that improve the performance, namely: multi-source features, fine-grained\nencoder, and score-weighted ensemble. We provide a series of analysis on their\nimpact on performance.", "comment": "CVPR 2019 Visual Question Answering and Dialog Workshop", "links": []}
{"entry_id": "1906.01784", "title": "Learning to Compose and Reason with Language Tree Structures for Visual Grounding", "authors": ["Richang Hong", "Daqing Liu", "Xiaoyu Mo", "Xiangnan He", "Hanwang Zhang"], "published": "2019-06-05 02:03:55", "updated": "2019-06-05 02:03:55", "summary": "Grounding natural language in images, such as localizing \"the black dog on\nthe left of the tree\", is one of the core problems in artificial intelligence,\nas it needs to comprehend the fine-grained and compositional language space.\nHowever, existing solutions rely on the association between the holistic\nlanguage features and visual features, while neglect the nature of\ncompositional reasoning implied in the language. In this paper, we propose a\nnatural language grounding model that can automatically compose a binary tree\nstructure for parsing the language and then perform visual reasoning along the\ntree in a bottom-up fashion. We call our model RVG-TREE: Recursive Grounding\nTree, which is inspired by the intuition that any language expression can be\nrecursively decomposed into two constituent parts, and the grounding confidence\nscore can be recursively accumulated by calculating their grounding scores\nreturned by sub-trees. RVG-TREE can be trained end-to-end by using the\nStraight-Through Gumbel-Softmax estimator that allows the gradients from the\ncontinuous score functions passing through the discrete tree construction.\nExperiments on several benchmarks show that our model achieves the\nstate-of-the-art performance with more explainable reasoning.", "comment": "Accepted to IEEE Transactions on Pattern Analysis and Machine\n  Intelligence (T-PAMI)", "links": ["http://dx.doi.org/10.1109/TPAMI.2019.2911066"]}
{"entry_id": "1905.08621", "title": "Shortest-Path-Preserving Rounding", "authors": ["Herman Haverkort", "David Kübel", "Elmar Langetepe"], "published": "2019-05-21 13:30:37", "updated": "2019-05-21 13:30:37", "summary": "Various applications of graphs, in particular applications related to finding\nshortest paths, naturally get inputs with real weights on the edges. However,\nfor algorithmic or visualization reasons, inputs with integer weights would\noften be preferable or even required. This raises the following question: given\nan undirected graph with non-negative real weights on the edges and an error\nthreshold $\\varepsilon$, how efficiently can we decide whether we can round all\nweights such that shortest paths are maintained, and the change of weight of\neach shortest path is less than $\\varepsilon$? So far, only for path-shaped\ngraphs a polynomial-time algorithm was known. In this paper we prove, by\nreduction from 3-SAT, that, in general, the problem is NP-hard. However, if the\ngraph is a tree with $n$ vertices, the problem can be solved in $O(n^2)$ time.", "comment": "20 pages, 5 figures, pre-print of an article presented at IWOCA 2019", "links": []}
{"entry_id": "1902.09506", "title": "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", "authors": ["Drew A. Hudson", "Christopher D. Manning"], "published": "2019-02-25 18:37:49", "updated": "2019-05-10 22:24:55", "summary": "We introduce GQA, a new dataset for real-world visual reasoning and\ncompositional question answering, seeking to address key shortcomings of\nprevious VQA datasets. We have developed a strong and robust question engine\nthat leverages scene graph structures to create 22M diverse reasoning\nquestions, all come with functional programs that represent their semantics. We\nuse the programs to gain tight control over the answer distribution and present\na new tunable smoothing technique to mitigate question biases. Accompanying the\ndataset is a suite of new metrics that evaluate essential qualities such as\nconsistency, grounding and plausibility. An extensive analysis is performed for\nbaselines as well as state-of-the-art models, providing fine-grained results\nfor different question types and topologies. Whereas a blind LSTM obtains mere\n42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%,\noffering ample opportunity for new research to explore. We strongly hope GQA\nwill provide an enabling resource for the next generation of models with\nenhanced robustness, improved consistency, and deeper semantic understanding\nfor images and language.", "comment": "Published as a conference paper at CVPR 2019 (oral)", "links": []}
{"entry_id": "1904.08608", "title": "Learning to Collocate Neural Modules for Image Captioning", "authors": ["Xu Yang", "Hanwang Zhang", "Jianfei Cai"], "published": "2019-04-18 07:03:19", "updated": "2019-04-18 07:03:19", "summary": "We do not speak word by word from scratch; our brain quickly structures a\npattern like \\textsc{sth do sth at someplace} and then fill in the detailed\ndescriptions. To render existing encoder-decoder image captioners such\nhuman-like reasoning, we propose a novel framework: learning to Collocate\nNeural Modules (CNM), to generate the `inner pattern' connecting visual encoder\nand language decoder. Unlike the widely-used neural module networks in visual\nQ\\&A, where the language (ie, question) is fully observable, CNM for captioning\nis more challenging as the language is being generated and thus is partially\nobservable. To this end, we make the following technical contributions for CNM\ntraining: 1) compact module design --- one for function words and three for\nvisual content words (eg, noun, adjective, and verb), 2) soft module fusion and\nmulti-step module execution, robustifying the visual reasoning in partial\nobservation, 3) a linguistic loss for module controller being faithful to\npart-of-speech collocations (eg, adjective is before noun). Extensive\nexperiments on the challenging MS-COCO image captioning benchmark validate the\neffectiveness of our CNM image captioner. In particular, CNM achieves a new\nstate-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40\non the official server. CNM is also robust to few training samples, eg, by\ntraining only one sentence per image, CNM can halve the performance loss\ncompared to a strong baseline.", "comment": null, "links": []}
{"entry_id": "1901.00850", "title": "CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions", "authors": ["Runtao Liu", "Chenxi Liu", "Yutong Bai", "Alan Yuille"], "published": "2019-01-03 18:58:06", "updated": "2019-04-06 19:59:25", "summary": "Referring object detection and referring image segmentation are important\ntasks that require joint understanding of visual information and natural\nlanguage. Yet there has been evidence that current benchmark datasets suffer\nfrom bias, and current state-of-the-art models cannot be easily evaluated on\ntheir intermediate reasoning process. To address these issues and complement\nsimilar efforts in visual question answering, we build CLEVR-Ref+, a synthetic\ndiagnostic dataset for referring expression comprehension. The precise\nlocations and attributes of the objects are readily available, and the\nreferring expressions are automatically associated with functional programs.\nThe synthetic nature allows control over dataset bias (through sampling\nstrategy), and the modular programs enable intermediate reasoning ground truth\nwithout human annotators.\n  In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we\nalso propose IEP-Ref, a module network approach that significantly outperforms\nother models on our dataset. In particular, we present two interesting and\nimportant findings using IEP-Ref: (1) the module trained to transform feature\nmaps into segmentation masks can be attached to any intermediate module to\nreveal the entire reasoning process step-by-step; (2) even if all training data\nhas at least one object referred, IEP-Ref can correctly predict no-foreground\nwhen presented with false-premise referring expressions. To the best of our\nknowledge, this is the first direct and quantitative proof that neural modules\nbehave in the way they are intended.", "comment": "To appear in CVPR 2019. All data and code concerning CLEVR-Ref+ and\n  IEP-Ref have been released at https://cs.jhu.edu/~cxliu/2019/clevr-ref+", "links": []}
{"entry_id": "1811.10830", "title": "From Recognition to Cognition: Visual Commonsense Reasoning", "authors": ["Rowan Zellers", "Yonatan Bisk", "Ali Farhadi", "Yejin Choi"], "published": "2018-11-27 06:22:26", "updated": "2019-03-26 17:50:34", "summary": "Visual understanding goes well beyond object recognition. With one glance at\nan image, we can effortlessly imagine the world beyond the pixels: for\ninstance, we can infer people's actions, goals, and mental states. While this\ntask is easy for humans, it is tremendously difficult for today's vision\nsystems, requiring higher-order cognition and commonsense reasoning about the\nworld. We formalize this task as Visual Commonsense Reasoning. Given a\nchallenging question about an image, a machine must answer correctly and then\nprovide a rationale justifying its answer.\n  Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA\nproblems derived from 110k movie scenes. The key recipe for generating\nnon-trivial and high-quality problems at scale is Adversarial Matching, a new\napproach to transform rich annotations into multiple choice questions with\nminimal bias. Experimental results show that while humans find VCR easy (over\n90% accuracy), state-of-the-art vision models struggle (~45%).\n  To move towards cognition-level understanding, we present a new reasoning\nengine, Recognition to Cognition Networks (R2C), that models the necessary\nlayered inferences for grounding, contextualization, and reasoning. R2C helps\nnarrow the gap between humans and machines (~65%); still, the challenge is far\nfrom solved, and we provide analysis that suggests avenues for future work.", "comment": "CVPR 2019 oral. Project page at https://visualcommonsense.com", "links": []}
{"entry_id": "1812.01855", "title": "Explainable and Explicit Visual Reasoning over Scene Graphs", "authors": ["Jiaxin Shi", "Hanwang Zhang", "Juanzi Li"], "published": "2018-12-05 08:35:05", "updated": "2019-03-19 12:55:14", "summary": "We aim to dismantle the prevalent black-box neural architectures used in\ncomplex visual reasoning tasks, into the proposed eXplainable and eXplicit\nNeural Modules (XNMs), which advance beyond existing neural module networks\ntowards using scene graphs --- objects as nodes and the pairwise relationships\nas edges --- for explainable and explicit reasoning with structured knowledge.\nXNMs allow us to pay more attention to teach machines how to \"think\",\nregardless of what they \"look\". As we will show in the paper, by using scene\ngraphs as an inductive bias, 1) we can design XNMs in a concise and flexible\nfashion, i.e., XNMs merely consist of 4 meta-types, which significantly reduce\nthe number of parameters by 10 to 100 times, and 2) we can explicitly trace the\nreasoning-flow in terms of graph attentions. XNMs are so generic that they\nsupport a wide range of scene graph implementations with various qualities. For\nexample, when the graphs are detected perfectly, XNMs achieve 100% accuracy on\nboth CLEVR and CLEVR CoGenT, establishing an empirical performance upper-bound\nfor visual reasoning; when the graphs are noisily detected from real-world\nimages, XNMs are still robust to achieve a competitive 67.5% accuracy on\nVQAv2.0, surpassing the popular bag-of-objects attention models without graph\nstructures.", "comment": "CVPR2019", "links": []}
{"entry_id": "1711.05240", "title": "Weakly-supervised Semantic Parsing with Abstract Examples", "authors": ["Omer Goldman", "Veronica Latcinnik", "Udi Naveh", "Amir Globerson", "Jonathan Berant"], "published": "2017-11-14 18:29:05", "updated": "2019-03-13 09:30:38", "summary": "Training semantic parsers from weak supervision (denotations) rather than\nstrong supervision (programs) complicates training in two ways. First, a large\nsearch space of potential programs needs to be explored at training time to\nfind a correct program. Second, spurious programs that accidentally lead to a\ncorrect denotation add noise to training. In this work we propose that in\nclosed worlds with clear semantic types, one can substantially alleviate these\nproblems by utilizing an abstract representation, where tokens in both the\nlanguage utterance and program are lifted to an abstract form. We show that\nthese abstractions can be defined with a handful of lexical rules and that they\nresult in sharing between different examples that alleviates the difficulties\nin training. To test our approach, we develop the first semantic parser for\nCNLVR, a challenging visual reasoning dataset, where the search space is large\nand overcoming spuriousness is critical, because denotations are either TRUE or\nFALSE, and thus random programs are likely to lead to a correct denotation. Our\nmethod substantially improves performance, and reaches 82.5% accuracy, a 14.7%\nabsolute accuracy improvement compared to the best reported accuracy so far.", "comment": "CNLVR,NLVR. Accepted to ACL 2018", "links": []}
{"entry_id": "1903.02741", "title": "RAVEN: A Dataset for Relational and Analogical Visual rEasoNing", "authors": ["Chi Zhang", "Feng Gao", "Baoxiong Jia", "Yixin Zhu", "Song-Chun Zhu"], "published": "2019-03-07 06:28:44", "updated": "2019-03-07 06:28:44", "summary": "Dramatic progress has been witnessed in basic vision tasks involving\nlow-level perception, such as object recognition, detection, and tracking.\nUnfortunately, there is still an enormous performance gap between artificial\nvision systems and human intelligence in terms of higher-level vision problems,\nespecially ones involving reasoning. Earlier attempts in equipping machines\nwith high-level reasoning have hovered around Visual Question Answering (VQA),\none typical task associating vision and language understanding. In this work,\nwe propose a new dataset, built in the context of Raven's Progressive Matrices\n(RPM) and aimed at lifting machine intelligence by associating vision with\nstructural, relational, and analogical reasoning in a hierarchical\nrepresentation. Unlike previous works in measuring abstract reasoning using\nRPM, we establish a semantic link between vision and reasoning by providing\nstructure representation. This addition enables a new type of abstract\nreasoning by jointly operating on the structure representation. Machine\nreasoning ability using modern computer vision is evaluated in this newly\nproposed dataset. Additionally, we also provide human performance as a\nreference. Finally, we show consistent improvement across all models by\nincorporating a simple neural module that combines visual understanding and\nstructure reasoning.", "comment": "CVPR 2019 paper. Supplementary:\n  http://wellyzhang.github.io/attach/cvpr19zhang_supp.pdf Project:\n  http://wellyzhang.github.io/project/raven.html", "links": []}
{"entry_id": "1902.11280", "title": "From Visual to Acoustic Question Answering", "authors": ["Jerome Abdelnour", "Giampiero Salvi", "Jean Rouat"], "published": "2019-02-28 18:35:45", "updated": "2019-02-28 18:35:45", "summary": "We introduce the new task of Acoustic Question Answering (AQA) to promote\nresearch in acoustic reasoning. The AQA task consists of analyzing an acoustic\nscene composed by a combination of elementary sounds and answering questions\nthat relate the position and properties of these sounds. The kind of relational\nquestions asked, require that the models perform non-trivial reasoning in order\nto answer correctly. Although similar problems have been extensively studied in\nthe domain of visual reasoning, we are not aware of any previous studies\naddressing the problem in the acoustic domain. We propose a method for\ngenerating the acoustic scenes from elementary sounds and a number of relevant\nquestions for each scene using templates. We also present preliminary results\nobtained with two models (FiLM and MAC) that have been shown to work for visual\nreasoning.", "comment": null, "links": []}
{"entry_id": "1902.04955", "title": "Can We Automate Diagrammatic Reasoning?", "authors": ["Sk. Arif Ahmed", "Debi Prosad Dogra", "Samarjit Kar", "Partha Pratim Roy", "Dilip K. Prasad"], "published": "2019-02-13 15:43:11", "updated": "2019-02-13 15:43:11", "summary": "Learning to solve diagrammatic reasoning (DR) can be a challenging but\ninteresting problem to the computer vision research community. It is believed\nthat next generation pattern recognition applications should be able to\nsimulate human brain to understand and analyze reasoning of images. However,\ndue to the lack of benchmarks of diagrammatic reasoning, the present research\nprimarily focuses on visual reasoning that can be applied to real-world\nobjects. In this paper, we present a diagrammatic reasoning dataset that\nprovides a large variety of DR problems. In addition, we also propose a\nKnowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning\nproblems. Our proposed analysis is arguably the first work in this research\narea. Several state-of-the-art learning frameworks have been used to compare\nwith the proposed KLSTM framework in the present context. Preliminary results\nindicate that the domain is highly related to computer vision and pattern\nrecognition research with several challenging avenues.", "comment": null, "links": []}
{"entry_id": "1901.06706", "title": "Visual Entailment: A Novel Task for Fine-Grained Image Understanding", "authors": ["Ning Xie", "Farley Lai", "Derek Doran", "Asim Kadav"], "published": "2019-01-20 17:55:05", "updated": "2019-01-20 17:55:05", "summary": "Existing visual reasoning datasets such as Visual Question Answering (VQA),\noften suffer from biases conditioned on the question, image or answer\ndistributions. The recently proposed CLEVR dataset addresses these limitations\nand requires fine-grained reasoning but the dataset is synthetic and consists\nof similar objects and sentence structures across the dataset.\n  In this paper, we introduce a new inference task, Visual Entailment (VE) -\nconsisting of image-sentence pairs whereby a premise is defined by an image,\nrather than a natural language sentence as in traditional Textual Entailment\ntasks. The goal of a trained VE model is to predict whether the image\nsemantically entails the text. To realize this task, we build a dataset SNLI-VE\nbased on the Stanford Natural Language Inference corpus and Flickr30k dataset.\nWe evaluate various existing VQA baselines and build a model called Explainable\nVisual Entailment (EVE) system to address the VE task. EVE achieves up to 71%\naccuracy and outperforms several other state-of-the-art VQA based models.\nFinally, we demonstrate the explainability of EVE through cross-modal attention\nvisualizations. The SNLI-VE dataset is publicly available at\nhttps://github.com/ necla-ml/SNLI-VE.", "comment": null, "links": []}
{"entry_id": "1901.05574", "title": "Visual Reasoning of Feature Attribution with Deep Recurrent Neural Networks", "authors": ["Chuan Wang", "Takeshi Onishi", "Keiichi Nemoto", "Kwan-Liu Ma"], "published": "2019-01-17 00:57:33", "updated": "2019-01-17 00:57:33", "summary": "Deep Recurrent Neural Network (RNN) has gained popularity in many sequence\nclassification tasks. Beyond predicting a correct class for each data instance,\ndata scientists also want to understand what differentiating factors in the\ndata have contributed to the classification during the learning process. We\npresent a visual analytics approach to facilitate this task by revealing the\nRNN attention for all data instances, their temporal positions in the\nsequences, and the attribution of variables at each value level. We demonstrate\nwith real-world datasets that our approach can help data scientists to\nunderstand such dynamics in deep RNNs from the training results, hence guiding\ntheir modeling process.", "comment": null, "links": []}
{"entry_id": "1812.03631", "title": "Spatial Knowledge Distillation to aid Visual Reasoning", "authors": ["Somak Aditya", "Rudra Saha", "Yezhou Yang", "Chitta Baral"], "published": "2018-12-10 05:36:23", "updated": "2018-12-11 16:42:29", "summary": "For tasks involving language and vision, the current state-of-the-art methods\ntend not to leverage any additional information that might be present to gather\nrelevant (commonsense) knowledge. A representative task is Visual Question\nAnswering where large diagnostic datasets have been proposed to test a system's\ncapability of answering questions about images. The training data is often\naccompanied by annotations of individual object properties and spatial\nlocations. In this work, we take a step towards integrating this additional\nprivileged information in the form of spatial knowledge to aid in visual\nreasoning. We propose a framework that combines recent advances in knowledge\ndistillation (teacher-student framework), relational reasoning and\nprobabilistic logical languages to incorporate such knowledge in existing\nneural networks for the task of Visual Question Answering. Specifically, for a\nquestion posed against an image, we use a probabilistic logical language to\nencode the spatial knowledge and the spatial understanding about the question\nin the form of a mask that is directly provided to the teacher network. The\nstudent network learns from the ground-truth information as well as the\nteachers prediction via distillation. We also demonstrate the impact of\npredicting such a mask inside the teachers network using attention.\nEmpirically, we show that both the methods improve the test accuracy over a\nstate-of-the-art approach on a publicly available dataset.", "comment": "Equal contribution by first two authors. Accepted in WACV 2019", "links": []}
{"entry_id": "1812.01880", "title": "Learning to Compose Dynamic Tree Structures for Visual Contexts", "authors": ["Kaihua Tang", "Hanwang Zhang", "Baoyuan Wu", "Wenhan Luo", "Wei Liu"], "published": "2018-12-05 09:51:19", "updated": "2018-12-05 09:51:19", "summary": "We propose to compose dynamic tree structures that place the objects in an\nimage into a visual context, helping visual reasoning tasks such as scene graph\ngeneration and visual Q&A. Our visual context tree model, dubbed VCTree, has\ntwo key advantages over existing structured object representations including\nchains and fully-connected graphs: 1) The efficient and expressive binary tree\nencodes the inherent parallel/hierarchical relationships among objects, e.g.,\n\"clothes\" and \"pants\" are usually co-occur and belong to \"person\"; 2) the\ndynamic structure varies from image to image and task to task, allowing more\ncontent-/task-specific message passing among objects. To construct a VCTree, we\ndesign a score function that calculates the task-dependent validity between\neach object pair, and the tree is the binary version of the maximum spanning\ntree from the score matrix. Then, visual contexts are encoded by bidirectional\nTreeLSTM and decoded by task-specific models. We develop a hybrid learning\nprocedure which integrates end-task supervised learning and the tree structure\nreinforcement learning, where the former's evaluation result serves as a\nself-critic for the latter's structure exploration. Experimental results on two\nbenchmarks, which require reasoning over contexts: Visual Genome for scene\ngraph generation and VQA2.0 for visual Q&A, show that VCTree outperforms\nstate-of-the-art results while discovering interpretable visual context\nstructures.", "comment": null, "links": []}
{"entry_id": "1710.09490", "title": "Complete 3D Scene Parsing from an RGBD Image", "authors": ["Chuhang Zou", "Ruiqi Guo", "Zhizhong Li", "Derek Hoiem"], "published": "2017-10-25 23:04:14", "updated": "2018-11-13 18:05:14", "summary": "One major goal of vision is to infer physical models of objects, surfaces,\nand their layout from sensors. In this paper, we aim to interpret indoor scenes\nfrom one RGBD image. Our representation encodes the layout of orthogonal walls\nand the extent of objects, modeled with CAD-like 3D shapes. We parse both the\nvisible and occluded portions of the scene and all observable objects,\nproducing a complete 3D parse. Such a scene interpretation is useful for\nrobotics and visual reasoning, but difficult to produce due to the well-known\nchallenge of segmentation, the high degree of occlusion, and the diversity of\nobjects in indoor scenes. We take a data-driven approach, generating sets of\npotential object regions, matching to regions in training images, and\ntransferring and aligning associated 3D models while encouraging fit to\nobservations and spatial consistency. We use support inference to aid\ninterpretation and propose a retrieval scheme that uses convolutional neural\nnetworks (CNNs) to classify regions and retrieve objects with similar shapes.\nWe demonstrate the performance of our method on our newly annotated NYUd v2\ndataset with detailed 3D shapes.", "comment": "Accepted to International Journal of Computer Vision (IJCV), 2018\n  arXiv admin note: text overlap with arXiv:1504.02437", "links": []}
{"entry_id": "1808.04446", "title": "Visual Reasoning with Multi-hop Feature Modulation", "authors": ["Florian Strub", "Mathieu Seurin", "Ethan Perez", "Harm de Vries", "Jérémie Mary", "Philippe Preux", "Aaron Courville", "Olivier Pietquin"], "published": "2018-08-03 14:32:02", "updated": "2018-10-12 11:36:42", "summary": "Recent breakthroughs in computer vision and natural language processing have\nspurred interest in challenging multi-modal tasks such as visual\nquestion-answering and visual dialogue. For such tasks, one successful approach\nis to condition image-based convolutional network computation on language via\nFeature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and\nshifting. We propose to generate the parameters of FiLM layers going up the\nhierarchy of a convolutional network in a multi-hop fashion rather than all at\nonce, as in prior work. By alternating between attending to the language input\nand generating FiLM layer parameters, this approach is better able to scale to\nsettings with longer input sequences such as dialogue. We demonstrate that\nmulti-hop FiLM generation achieves state-of-the-art for the short input\nsequence task ReferIt --- on-par with single-hop FiLM generation --- while also\nsignificantly outperforming prior state-of-the-art and single-hop FiLM\ngeneration on the GuessWhat?! visual dialogue task.", "comment": "In Proc of ECCV 2018", "links": []}
{"entry_id": "1808.09132", "title": "Mapping Natural Language Commands to Web Elements", "authors": ["Panupong Pasupat", "Tian-Shun Jiang", "Evan Zheran Liu", "Kelvin Guu", "Percy Liang"], "published": "2018-08-28 06:09:39", "updated": "2018-10-01 03:22:58", "summary": "The web provides a rich, open-domain environment with textual, structural,\nand spatial properties. We propose a new task for grounding language in this\nenvironment: given a natural language command (e.g., \"click on the second\narticle\"), choose the correct element on the web page (e.g., a hyperlink or\ntext box). We collected a dataset of over 50,000 commands that capture various\nphenomena such as functional references (e.g. \"find who made this site\"),\nrelational reasoning (e.g. \"article by john\"), and visual reasoning (e.g.\n\"top-most article\"). We also implemented and analyzed three baseline models\nthat capture different phenomena present in the dataset.", "comment": "EMNLP 2018", "links": []}
{"entry_id": "1806.02453", "title": "Visual Reasoning by Progressive Module Networks", "authors": ["Seung Wook Kim", "Makarand Tapaswi", "Sanja Fidler"], "published": "2018-06-06 23:02:35", "updated": "2018-09-27 18:09:38", "summary": "Humans learn to solve tasks of increasing complexity by building on top of\npreviously acquired knowledge. Typically, there exists a natural progression in\nthe tasks that we learn - most do not require completely independent solutions,\nbut can be broken down into simpler subtasks. We propose to represent a solver\nfor each task as a neural module that calls existing modules (solvers for\nsimpler tasks) in a functional program-like manner. Lower modules are a black\nbox to the calling module, and communicate only via a query and an output.\nThus, a module for a new task learns to query existing modules and composes\ntheir outputs in order to produce its own output. Our model effectively\ncombines previous skill-sets, does not suffer from forgetting, and is fully\ndifferentiable. We test our model in learning a set of visual reasoning tasks,\nand demonstrate improved performances in all tasks by learning progressively.\nBy evaluating the reasoning process using human judges, we show that our model\nis more interpretable than an attention-based baseline.", "comment": "17 pages, 5 figures", "links": []}
{"entry_id": "1806.06157", "title": "Object Level Visual Reasoning in Videos", "authors": ["Fabien Baradel", "Natalia Neverova", "Christian Wolf", "Julien Mille", "Greg Mori"], "published": "2018-06-16 00:33:50", "updated": "2018-09-20 08:59:32", "summary": "Human activity recognition is typically addressed by detecting key concepts\nlike global and local motion, features related to object classes present in the\nscene, as well as features related to the global context. The next open\nchallenges in activity recognition require a level of understanding that pushes\nbeyond this and call for models with capabilities for fine distinction and\ndetailed comprehension of interactions between actors and objects in a scene.\nWe propose a model capable of learning to reason about semantically meaningful\nspatiotemporal interactions in videos. The key to our approach is a choice of\nperforming this reasoning at the object level through the integration of state\nof the art object detection networks. This allows the model to learn detailed\nspatial interactions that exist at a semantic, object-interaction relevant\nlevel. We evaluate our method on three standard datasets (Twenty-BN\nSomething-Something, VLOG and EPIC Kitchens) and achieve state of the art\nresults on all of them. Finally, we show visualizations of the interactions\nlearned by the model, which illustrate object classes and their interactions\ncorresponding to different activity classes.", "comment": "Accepted at ECCV 2018 - long version (16 pages + ref)", "links": []}
{"entry_id": "1804.06870", "title": "Object Ordering with Bidirectional Matchings for Visual Reasoning", "authors": ["Hao Tan", "Mohit Bansal"], "published": "2018-04-18 18:39:17", "updated": "2018-09-06 16:56:32", "summary": "Visual reasoning with compositional natural language instructions, e.g.,\nbased on the newly-released Cornell Natural Language Visual Reasoning (NLVR)\ndataset, is a challenging task, where the model needs to have the ability to\ncreate an accurate mapping between the diverse phrases and the several objects\nplaced in complex arrangements in the image. Further, this mapping needs to be\nprocessed to answer the question in the statement given the ordering and\nrelationship of the objects across three similar images. In this paper, we\npropose a novel end-to-end neural model for the NLVR task, where we first use\njoint bidirectional attention to build a two-way conditioning between the\nvisual information and the language phrases. Next, we use an RL-based pointer\nnetwork to sort and process the varying number of unordered objects (so as to\nmatch the order of the statement phrases) in each of the three images and then\npool over the three decisions. Our model achieves strong improvements (of 4-6%\nabsolute) over the state-of-the-art on both the structured representation and\nraw image versions of the dataset.", "comment": "NAACL 2018 (8 pages; added pointer-ordering examples)", "links": []}
{"entry_id": "1809.01943", "title": "Cascaded Mutual Modulation for Visual Reasoning", "authors": ["Yiqun Yao", "Jiaming Xu", "Feng Wang", "Bo Xu"], "published": "2018-09-06 12:26:24", "updated": "2018-09-06 12:26:24", "summary": "Visual reasoning is a special visual question answering problem that is\nmulti-step and compositional by nature, and also requires intensive text-vision\ninteractions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end\nvisual reasoning model. CMM includes a multi-step comprehension process for\nboth question and image. In each step, we use a Feature-wise Linear Modulation\n(FiLM) technique to enable textual/visual pipeline to mutually control each\nother. Experiments show that CMM significantly outperforms most related models,\nand reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR,\ncollected from both synthetic and natural languages. Ablation studies confirm\nthat both our multistep framework and our visual-guided language modulation are\ncritical to the task. Our code is available at\nhttps://github.com/FlamingHorizon/CMM-VR.", "comment": "to appear in EMNLP 2018", "links": []}
{"entry_id": "1808.09068", "title": "WeSeer: Visual Analysis for Better Information Cascade Prediction of WeChat Articles", "authors": ["Quan Li", "Ziming Wu", "Lingling Yi", "Kristanto Sean N", "Huamin Qu", "Xiaojuan Ma"], "published": "2018-08-28 00:09:20", "updated": "2018-08-28 00:09:20", "summary": "Social media, such as Facebook and WeChat, empowers millions of users to\ncreate, consume, and disseminate online information on an unprecedented scale.\nThe abundant information on social media intensifies the competition of WeChat\nPublic Official Articles (i.e., posts) for gaining user attention due to the\nzero-sum nature of attention. Therefore, only a small portion of information\ntends to become extremely popular while the rest remains unnoticed or quickly\ndisappears. Such a typical `long-tail' phenomenon is very common in social\nmedia. Thus, recent years have witnessed a growing interest in predicting the\nfuture trend in the popularity of social media posts and understanding the\nfactors that influence the popularity of the posts. Nevertheless, existing\npredictive models either rely on cumbersome feature engineering or\nsophisticated parameter tuning, which are difficult to understand and improve.\nIn this paper, we study and enhance a point process-based model by\nincorporating visual reasoning to support communication between the users and\nthe predictive model for a better prediction result. The proposed system\nsupports users to uncover the working mechanism behind the model and improve\nthe prediction accuracy accordingly based on the insights gained. We use\nrealistic WeChat articles to demonstrate the effectiveness of the system and\nverify the improved model on a large scale of WeChat articles. We also elicit\nand summarize the feedback from WeChat domain experts.", "comment": "IEEE Transactions on Visualization and Computer Graphics (TVCG), 2019\n  (To appear)", "links": []}
{"entry_id": "1803.06092", "title": "A Dataset and Architecture for Visual Reasoning with a Working Memory", "authors": ["Guangyu Robert Yang", "Igor Ganichev", "Xiao-Jing Wang", "Jonathon Shlens", "David Sussillo"], "published": "2018-03-16 06:53:45", "updated": "2018-07-20 14:12:49", "summary": "A vexing problem in artificial intelligence is reasoning about events that\noccur in complex, changing visual stimuli such as in video analysis or game\nplay. Inspired by a rich tradition of visual reasoning and memory in cognitive\npsychology and neuroscience, we developed an artificial, configurable visual\nquestion and answer dataset (COG) to parallel experiments in humans and\nanimals. COG is much simpler than the general problem of video analysis, yet it\naddresses many of the problems relating to visual and logical reasoning and\nmemory -- problems that remain challenging for modern deep learning\narchitectures. We additionally propose a deep learning architecture that\nperforms competitively on other diagnostic VQA datasets (i.e. CLEVR) as well as\neasy settings of the COG dataset. However, several settings of COG result in\ndatasets that are progressively more challenging to learn. After training, the\nnetwork can zero-shot generalize to many new tasks. Preliminary analyses of the\nnetwork architectures trained on COG demonstrate that the network accomplishes\nthe task in a manner interpretable to humans.", "comment": null, "links": []}
{"entry_id": "1803.05268", "title": "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning", "authors": ["David Mascharka", "Philip Tran", "Ryan Soklaski", "Arjun Majumdar"], "published": "2018-03-14 13:33:06", "updated": "2018-07-02 18:48:31", "summary": "Visual question answering requires high-order reasoning about an image, which\nis a fundamental capability needed by machine systems to follow complex\ndirectives. Recently, modular networks have been shown to be an effective\nframework for performing visual reasoning tasks. While modular networks were\ninitially designed with a degree of model transparency, their performance on\ncomplex visual reasoning benchmarks was lacking. Current state-of-the-art\napproaches do not provide an effective mechanism for understanding the\nreasoning process. In this paper, we close the performance gap between\ninterpretable models and state-of-the-art visual reasoning methods. We propose\na set of visual-reasoning primitives which, when composed, manifest as a model\ncapable of performing complex reasoning tasks in an explicitly-interpretable\nmanner. The fidelity and interpretability of the primitives' outputs enable an\nunparalleled ability to diagnose the strengths and weaknesses of the resulting\nmodel. Critically, we show that these primitives are highly performant,\nachieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show\nthat our model is able to effectively learn generalized representations when\nprovided a small amount of data containing novel object attributes. Using the\nCoGenT generalization task, we show more than a 20 percentage point improvement\nover the current state of the art.", "comment": "CVPR 2018 pre-print", "links": ["http://dx.doi.org/10.1109/CVPR.2018.00519"]}
{"entry_id": "1806.06765", "title": "Modularity Matters: Learning Invariant Relational Reasoning Tasks", "authors": ["Jason Jo", "Vikas Verma", "Yoshua Bengio"], "published": "2018-06-18 15:19:04", "updated": "2018-06-18 15:19:04", "summary": "We focus on two supervised visual reasoning tasks whose labels encode a\nsemantic relational rule between two or more objects in an image: the MNIST\nParity task and the colorized Pentomino task. The objects in the images undergo\nrandom translation, scaling, rotation and coloring transformations. Thus these\ntasks involve invariant relational reasoning. We report uneven performance of\nvarious deep CNN models on these two tasks. For the MNIST Parity task, we\nreport that the VGG19 model soundly outperforms a family of ResNet models.\nMoreover, the family of ResNet models exhibits a general sensitivity to random\ninitialization for the MNIST Parity task. For the colorized Pentomino task, now\nboth the VGG19 and ResNet models exhibit sluggish optimization and very poor\ntest generalization, hovering around 30% test error. The CNN we tested all\nlearn hierarchies of fully distributed features and thus encode the distributed\nrepresentation prior. We are motivated by a hypothesis from cognitive\nneuroscience which posits that the human visual cortex is modularized, and this\nallows the visual cortex to learn higher order invariances. To this end, we\nconsider a modularized variant of the ResNet model, referred to as a Residual\nMixture Network (ResMixNet) which employs a mixture-of-experts architecture to\ninterleave distributed representations with more specialized, modular\nrepresentations. We show that very shallow ResMixNets are capable of learning\neach of the two tasks well, attaining less than 2% and 1% test error on the\nMNIST Parity and the colorized Pentomino tasks respectively. Most importantly,\nthe ResMixNet models are extremely parameter efficient: generalizing better\nthan various non-modular CNNs that have over 10x the number of parameters.\nThese experimental results support the hypothesis that modularity is a robust\nprior for learning invariant relational reasoning.", "comment": "Modified abstract to fit arXiv character limit", "links": []}
{"entry_id": "1712.07576", "title": "Learning to Act Properly: Predicting and Explaining Affordances from Images", "authors": ["Ching-Yao Chuang", "Jiaman Li", "Antonio Torralba", "Sanja Fidler"], "published": "2017-12-20 16:54:09", "updated": "2018-06-15 05:26:46", "summary": "We address the problem of affordance reasoning in diverse scenes that appear\nin the real world. Affordances relate the agent's actions to their effects when\ntaken on the surrounding objects. In our work, we take the egocentric view of\nthe scene, and aim to reason about action-object affordances that respect both\nthe physical world as well as the social norms imposed by the society. We also\naim to teach artificial agents why some actions should not be taken in certain\nsituations, and what would likely happen if these actions would be taken. We\ncollect a new dataset that builds upon ADE20k, referred to as ADE-Affordance,\nwhich contains annotations enabling such rich visual reasoning. We propose a\nmodel that exploits Graph Neural Networks to propagate contextual information\nfrom the scene in order to perform detailed affordance reasoning about each\nobject. Our model is showcased through various ablation studies, pointing to\nsuccesses and challenges in this complex task.", "comment": null, "links": []}
{"entry_id": "1711.06526", "title": "Multi-Label Zero-Shot Learning with Structured Knowledge Graphs", "authors": ["Chung-Wei Lee", "Wei Fang", "Chih-Kuan Yeh", "Yu-Chiang Frank Wang"], "published": "2017-11-17 13:31:57", "updated": "2018-05-26 12:48:10", "summary": "In this paper, we propose a novel deep learning architecture for multi-label\nzero-shot learning (ML-ZSL), which is able to predict multiple unseen class\nlabels for each input instance. Inspired by the way humans utilize semantic\nknowledge between objects of interests, we propose a framework that\nincorporates knowledge graphs for describing the relationships between multiple\nlabels. Our model learns an information propagation mechanism from the semantic\nlabel space, which can be applied to model the interdependencies between seen\nand unseen class labels. With such investigation of structured knowledge graphs\nfor visual reasoning, we show that our model can be applied for solving\nmulti-label classification and ML-ZSL tasks. Compared to state-of-the-art\napproaches, comparable or improved performances can be achieved by our method.", "comment": "CVPR 2018", "links": []}
{"entry_id": "1802.03390", "title": "Same-different problems strain convolutional neural networks", "authors": ["Matthew Ricci", "Junkyung Kim", "Thomas Serre"], "published": "2018-02-09 18:55:34", "updated": "2018-05-25 17:00:23", "summary": "The robust and efficient recognition of visual relations in images is a\nhallmark of biological vision. We argue that, despite recent progress in visual\nrecognition, modern machine vision algorithms are severely limited in their\nability to learn visual relations. Through controlled experiments, we\ndemonstrate that visual-relation problems strain convolutional neural networks\n(CNNs). The networks eventually break altogether when rote memorization becomes\nimpossible, as when intra-class variability exceeds network capacity. Motivated\nby the comparable success of biological vision, we argue that feedback\nmechanisms including attention and perceptual grouping may be the key\ncomputational components underlying abstract visual reasoning.\\", "comment": "6 Pages, 4 Figures", "links": []}
{"entry_id": "1803.03067", "title": "Compositional Attention Networks for Machine Reasoning", "authors": ["Drew A. Hudson", "Christopher D. Manning"], "published": "2018-03-08 12:37:14", "updated": "2018-04-24 10:25:07", "summary": "We present the MAC network, a novel fully differentiable neural network\narchitecture, designed to facilitate explicit and expressive reasoning. MAC\nmoves away from monolithic black-box neural architectures towards a design that\nencourages both transparency and versatility. The model approaches problems by\ndecomposing them into a series of attention-based reasoning steps, each\nperformed by a novel recurrent Memory, Attention, and Composition (MAC) cell\nthat maintains a separation between control and memory. By stringing the cells\ntogether and imposing structural constraints that regulate their interaction,\nMAC effectively learns to perform iterative reasoning processes that are\ndirectly inferred from the data in an end-to-end approach. We demonstrate the\nmodel's strength, robustness and interpretability on the challenging CLEVR\ndataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy,\nhalving the error rate of the previous best model. More importantly, we show\nthat the model is computationally-efficient and data-efficient, in particular\nrequiring 5x less data than existing models to achieve strong results.", "comment": "Published as a conference paper at ICLR 2018", "links": []}
{"entry_id": "1803.11189", "title": "Iterative Visual Reasoning Beyond Convolutions", "authors": ["Xinlei Chen", "Li-Jia Li", "Li Fei-Fei", "Abhinav Gupta"], "published": "2018-03-29 17:59:03", "updated": "2018-03-29 17:59:03", "summary": "We present a novel framework for iterative visual reasoning. Our framework\ngoes beyond current recognition systems that lack the capability to reason\nbeyond stack of convolutions. The framework consists of two core modules: a\nlocal module that uses spatial memory to store previous beliefs with parallel\nupdates; and a global graph-reasoning module. Our graph module has three\ncomponents: a) a knowledge graph where we represent classes as nodes and build\nedges to encode different types of semantic relationships between them; b) a\nregion graph of the current image where regions in the image are nodes and\nspatial relationships between these regions are edges; c) an assignment graph\nthat assigns regions to classes. Both the local module and the global module\nroll-out iteratively and cross-feed predictions to each other to refine\nestimates. The final predictions are made by combining the best of both modules\nwith an attention mechanism. We show strong performance over plain ConvNets,\n\\eg achieving an $8.4\\%$ absolute improvement on ADE measured by per-class\naverage precision. Analysis also shows that the framework is resilient to\nmissing regions for reasoning.", "comment": "CVPR 2018", "links": []}
{"entry_id": "1704.04882", "title": "Monoidal computer III: A coalgebraic view of computability and complexity", "authors": ["Dusko Pavlovic", "Muzamil Yahia"], "published": "2017-04-17 06:27:29", "updated": "2018-03-12 01:36:07", "summary": "Monoidal computer is a categorical model of intensional computation, where\nmany different programs correspond to the same input-output behavior. The\nupshot of yet another model of computation is that a categorical formalism\nshould provide a much needed high level language for theory of computation,\nflexible enough to allow abstracting away the low level implementation details\nwhen they are irrelevant, or taking them into account when they are genuinely\nneeded. A salient feature of the approach through monoidal categories is the\nformal graphical language of string diagrams, which supports visual reasoning\nabout programs and computations.\n  In the present paper, we provide a coalgebraic characterization of monoidal\ncomputer. It turns out that the availability of interpreters and specializers,\nthat make a monoidal category into a monoidal computer, is equivalent with the\nexistence of a *universal state space*, that carries a weakly final state\nmachine for any pair of input and output types. Being able to program state\nmachines in monoidal computers allows us to represent Turing machines, to\ncapture their execution, count their steps, as well as, e.g., the memory cells\nthat they use. The coalgebraic view of monoidal computer thus provides a\nconvenient diagrammatic language for studying computability and complexity.", "comment": "34 pages, 24 figures; in this version: added the Appendix", "links": []}
{"entry_id": "1710.07300", "title": "FigureQA: An Annotated Figure Dataset for Visual Reasoning", "authors": ["Samira Ebrahimi Kahou", "Vincent Michalski", "Adam Atkinson", "Akos Kadar", "Adam Trischler", "Yoshua Bengio"], "published": "2017-10-19 18:01:38", "updated": "2018-02-22 22:50:42", "summary": "We introduce FigureQA, a visual reasoning corpus of over one million\nquestion-answer pairs grounded in over 100,000 images. The images are\nsynthetic, scientific-style figures from five classes: line plots, dot-line\nplots, vertical and horizontal bar graphs, and pie charts. We formulate our\nreasoning task by generating questions from 15 templates; questions concern\nvarious relationships between plot elements and examine characteristics like\nthe maximum, the minimum, area-under-the-curve, smoothness, and intersection.\nTo resolve, such questions often require reference to multiple plot elements\nand synthesis of information distributed spatially throughout a figure. To\nfacilitate the training of machine learning systems, the corpus also includes\nside data that can be used to formulate auxiliary objectives. In particular, we\nprovide the numerical data used to generate each figure as well as bounding-box\nannotations for all plot elements. We study the proposed visual reasoning task\nby training several models, including the recently proposed Relation Network as\na strong baseline. Preliminary results indicate that the task poses a\nsignificant machine learning challenge. We envision FigureQA as a first step\ntowards developing models that can intuitively recognize patterns from visual\nrepresentations of data.", "comment": "workshop paper at ICLR 2018", "links": []}
{"entry_id": "1801.05302", "title": "Benchmark Visual Question Answer Models by using Focus Map", "authors": ["Wenda Qiu", "Yueyang Xianzang", "Zhekai Zhang"], "published": "2018-01-13 09:09:33", "updated": "2018-01-13 09:09:33", "summary": "Inferring and Executing Programs for Visual Reasoning proposes a model for\nvisual reasoning that consists of a program generator and an execution engine\nto avoid end-to-end models. To show that the model actually learns which\nobjects to focus on to answer the questions, the authors give a visualization\nof the norm of the gradient of the sum of the predicted answer scores with\nrespect to the final feature map. However, the authors do not evaluate the\nefficiency of focus map. This paper purposed a method for evaluating it. We\ngenerate several kinds of questions to test different keywords. We infer focus\nmaps from the model by asking these questions and evaluate them by comparing\nwith the segmentation graph. Furthermore, this method can be applied to any\nmodel if focus maps can be inferred from it. By evaluating focus map of\ndifferent models on the CLEVR dataset, we will show that CLEVR-iep model has\nlearned where to focus more than end-to-end models.", "comment": "A group project paper for course CS348. arXiv admin note: text\n  overlap with arXiv:1705.03633 by other authors", "links": []}
{"entry_id": "1707.03017", "title": "Learning Visual Reasoning Without Strong Priors", "authors": ["Ethan Perez", "Harm de Vries", "Florian Strub", "Vincent Dumoulin", "Aaron Courville"], "published": "2017-07-10 18:49:28", "updated": "2017-12-18 21:37:16", "summary": "Achieving artificial visual reasoning - the ability to answer image-related\nquestions which require a multi-step, high-level process - is an important step\ntowards artificial general intelligence. This multi-modal task requires\nlearning a question-dependent, structured reasoning process over images from\nlanguage. Standard deep learning approaches tend to exploit biases in the data\nrather than learn this underlying structure, while leading methods learn to\nvisually reason successfully but are hand-crafted for reasoning. We show that a\ngeneral-purpose, Conditional Batch Normalization approach achieves\nstate-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4%\nerror rate. We outperform the next best end-to-end method (4.5%) and even\nmethods that use extra supervision (3.1%). We probe our model to shed light on\nhow it reasons, showing it has learned a question-dependent, multi-step\nprocess. Previous work has operated under the assumption that visual reasoning\ncalls for a specialized architecture, but we show that a general architecture\nwith proper conditioning can learn to visually reason effectively.", "comment": "Full AAAI 2018 paper is at arXiv:1709.07871. Presented at ICML 2017's\n  Machine Learning in Speech and Language Processing Workshop. Code is at\n  http://github.com/ethanjperez/film", "links": []}
{"entry_id": "1709.07871", "title": "FiLM: Visual Reasoning with a General Conditioning Layer", "authors": ["Ethan Perez", "Florian Strub", "Harm de Vries", "Vincent Dumoulin", "Aaron Courville"], "published": "2017-09-22 17:54:12", "updated": "2017-12-18 21:25:53", "summary": "We introduce a general-purpose conditioning method for neural networks called\nFiLM: Feature-wise Linear Modulation. FiLM layers influence neural network\ncomputation via a simple, feature-wise affine transformation based on\nconditioning information. We show that FiLM layers are highly effective for\nvisual reasoning - answering image-related questions which require a\nmulti-step, high-level process - a task which has proven difficult for standard\ndeep learning methods that do not explicitly model reasoning. Specifically, we\nshow on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error\nfor the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are\nrobust to ablations and architectural modifications, and 4) generalize well to\nchallenging, new data from few examples or even zero-shot.", "comment": "AAAI 2018. Code available at http://github.com/ethanjperez/film .\n  Extends arXiv:1707.03017", "links": []}
{"entry_id": "1707.01932", "title": "End-to-End Learning of Semantic Grasping", "authors": ["Eric Jang", "Sudheendra Vijayanarasimhan", "Peter Pastor", "Julian Ibarz", "Sergey Levine"], "published": "2017-07-06 18:41:22", "updated": "2017-11-09 08:57:52", "summary": "We consider the task of semantic robotic grasping, in which a robot picks up\nan object of a user-specified class using only monocular images. Inspired by\nthe two-stream hypothesis of visual reasoning, we present a semantic grasping\nframework that learns object detection, classification, and grasp planning in\nan end-to-end fashion. A \"ventral stream\" recognizes object class while a\n\"dorsal stream\" simultaneously interprets the geometric relationships necessary\nto execute successful grasps. We leverage the autonomous data collection\ncapabilities of robots to obtain a large self-supervised dataset for training\nthe dorsal stream, and use semi-supervised label propagation to train the\nventral stream with only a modest amount of human supervision. We\nexperimentally show that our approach improves upon grasping systems whose\ncomponents are not learned end-to-end, including a baseline method that uses\nbounding box detection. Furthermore, we show that jointly training our model\nwith auxiliary data consisting of non-semantic grasping data, as well as\nsemantically labeled images without grasp actions, has the potential to\nsubstantially improve semantic grasping performance.", "comment": "14 pages", "links": []}
{"entry_id": "1710.00453", "title": "Visual Reasoning with Natural Language", "authors": ["Stephanie Zhou", "Alane Suhr", "Yoav Artzi"], "published": "2017-10-02 01:52:05", "updated": "2017-10-02 01:52:05", "summary": "Natural language provides a widely accessible and expressive interface for\nrobotic agents. To understand language in complex environments, agents must\nreason about the full range of language inputs and their correspondence to the\nworld. Such reasoning over language and vision is an open problem that is\nreceiving increasing attention. While existing data sets focus on visual\ndiversity, they do not display the full range of natural language expressions,\nsuch as counting, set reasoning, and comparisons.\n  We propose a simple task for natural language visual reasoning, where images\nare paired with descriptive statements. The task is to predict if a statement\nis true for the given scene. This abstract describes our existing synthetic\nimages corpus and our current work on collecting real vision data.", "comment": "AAAI NCHRC 2017", "links": []}
{"entry_id": "1504.02437", "title": "Predicting Complete 3D Models of Indoor Scenes", "authors": ["Ruiqi Guo", "Chuhang Zou", "Derek Hoiem"], "published": "2015-04-09 19:25:33", "updated": "2017-08-18 01:55:57", "summary": "One major goal of vision is to infer physical models of objects, surfaces,\nand their layout from sensors. In this paper, we aim to interpret indoor scenes\nfrom one RGBD image. Our representation encodes the layout of walls, which must\nconform to a Manhattan structure but is otherwise flexible, and the layout and\nextent of objects, modeled with CAD-like 3D shapes. We represent both the\nvisible and occluded portions of the scene, producing a complete 3D parse. Such\na scene interpretation is useful for robotics and visual reasoning, but\ndifficult to produce due to the well-known challenge of segmentation, the high\ndegree of occlusion, and the diversity of objects in indoor scene. We take a\ndata-driven approach, generating sets of potential object regions, matching to\nregions in training images, and transferring and aligning associated 3D models\nwhile encouraging fit to observations and overall consistency. We demonstrate\nencouraging results on the NYU v2 dataset and highlight a variety of\ninteresting directions for future work.", "comment": null, "links": []}
{"entry_id": "1705.08844", "title": "How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval", "authors": ["Rodrigo Toro Icarte", "Jorge A. Baier", "Cristian Ruz", "Alvaro Soto"], "published": "2017-05-24 16:22:53", "updated": "2017-05-24 16:22:53", "summary": "The knowledge representation community has built general-purpose ontologies\nwhich contain large amounts of commonsense knowledge over relevant aspects of\nthe world, including useful visual information, e.g.: \"a ball is used by a\nfootball player\", \"a tennis player is located at a tennis court\". Current\nstate-of-the-art approaches for visual recognition do not exploit these\nrule-based knowledge sources. Instead, they learn recognition models directly\nfrom training examples. In this paper, we study how general-purpose\nontologies---specifically, MIT's ConceptNet ontology---can improve the\nperformance of state-of-the-art vision systems. As a testbed, we tackle the\nproblem of sentence-based image retrieval. Our retrieval approach incorporates\nknowledge from ConceptNet on top of a large pool of object detectors derived\nfrom a deep learning technique. In our experiments, we show that ConceptNet can\nimprove performance on a common benchmark dataset. Key to our performance is\nthe use of the ESPGAME dataset to select visually relevant relations from\nConceptNet. Consequently, a main conclusion of this work is that\ngeneral-purpose commonsense ontologies improve performance on visual reasoning\ntasks when properly filtered to select meaningful visual relations.", "comment": "Accepted in IJCAI-17", "links": []}
{"entry_id": "1705.03633", "title": "Inferring and Executing Programs for Visual Reasoning", "authors": ["Justin Johnson", "Bharath Hariharan", "Laurens van der Maaten", "Judy Hoffman", "Li Fei-Fei", "C. Lawrence Zitnick", "Ross Girshick"], "published": "2017-05-10 07:08:23", "updated": "2017-05-10 07:08:23", "summary": "Existing methods for visual reasoning attempt to directly map inputs to\noutputs using black-box architectures without explicitly modeling the\nunderlying reasoning processes. As a result, these black-box models often learn\nto exploit biases in the data rather than learning to perform visual reasoning.\nInspired by module networks, this paper proposes a model for visual reasoning\nthat consists of a program generator that constructs an explicit representation\nof the reasoning process to be performed, and an execution engine that executes\nthe resulting program to produce an answer. Both the program generator and the\nexecution engine are implemented by neural networks, and are trained using a\ncombination of backpropagation and REINFORCE. Using the CLEVR benchmark for\nvisual reasoning, we show that our model significantly outperforms strong\nbaselines and generalizes better in a variety of settings.", "comment": null, "links": []}
{"entry_id": "1612.08153", "title": "EgoReID: Cross-view Self-Identification and Human Re-identification in Egocentric and Surveillance Videos", "authors": ["Shervin Ardeshir", "Sandesh Sharma", "Ali Broji"], "published": "2016-12-24 09:00:37", "updated": "2016-12-24 09:00:37", "summary": "Human identification remains to be one of the challenging tasks in computer\nvision community due to drastic changes in visual features across different\nviewpoints, lighting conditions, occlusion, etc. Most of the literature has\nbeen focused on exploring human re-identification across viewpoints that are\nnot too drastically different in nature. Cameras usually capture oblique or\nside views of humans, leaving room for a lot of geometric and visual reasoning.\nGiven the recent popularity of egocentric and top-view vision,\nre-identification across these two drastically different views can now be\nexplored. Having an egocentric and a top view video, our goal is to identify\nthe cameraman in the content of the top-view video, and also re-identify the\npeople visible in the egocentric video, by matching them to the identities\npresent in the top-view video. We propose a CRF-based method to address the two\nproblems. Our experimental results demonstrates the efficiency of the proposed\napproach over a variety of video recorded from two views.", "comment": null, "links": []}
{"entry_id": "1612.06890", "title": "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning", "authors": ["Justin Johnson", "Bharath Hariharan", "Laurens van der Maaten", "Li Fei-Fei", "C. Lawrence Zitnick", "Ross Girshick"], "published": "2016-12-20 21:40:40", "updated": "2016-12-20 21:40:40", "summary": "When building artificial intelligence systems that can reason and answer\nquestions about visual data, we need diagnostic tests to analyze our progress\nand discover shortcomings. Existing benchmarks for visual question answering\ncan help, but have strong biases that models can exploit to correctly answer\nquestions without reasoning. They also conflate multiple sources of error,\nmaking it hard to pinpoint model weaknesses. We present a diagnostic dataset\nthat tests a range of visual reasoning abilities. It contains minimal biases\nand has detailed annotations describing the kind of reasoning each question\nrequires. We use this dataset to analyze a variety of modern visual reasoning\nsystems, providing novel insights into their abilities and limitations.", "comment": null, "links": []}
{"entry_id": "1605.05462", "title": "Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery", "authors": ["Alina Marcu", "Marius Leordeanu"], "published": "2016-05-18 07:37:22", "updated": "2016-05-18 07:37:22", "summary": "Visual context is important in object recognition and it is still an open\nproblem in computer vision. Along with the advent of deep convolutional neural\nnetworks (CNN), using contextual information with such systems starts to\nreceive attention in the literature. At the same time, aerial imagery is\ngaining momentum. While advances in deep learning make good progress in aerial\nimage analysis, this problem still poses many great challenges. Aerial images\nare often taken under poor lighting conditions and contain low resolution\nobjects, many times occluded by trees or taller buildings. In this domain, in\nparticular, visual context could be of great help, but there are still very few\npapers that consider context in aerial image understanding. Here we introduce\ncontext as a complementary way of recognizing objects. We propose a dual-stream\ndeep neural network model that processes information along two independent\npathways, one for local and another for global visual reasoning. The two are\nlater combined in the final layers of processing. Our model learns to combine\nlocal object appearance as well as information from the larger scene at the\nsame time and in a complementary way, such that together they form a powerful\nclassifier. We test our dual-stream network on the task of segmentation of\nbuildings and roads in aerial images and obtain state-of-the-art results on the\nMassachusetts Buildings Dataset. We also introduce two new datasets, for\nbuildings and road segmentation, respectively, and study the relative\nimportance of local appearance vs. the larger scene, as well as their\nperformance in combination. While our local-global model could also be useful\nin general recognition tasks, we clearly demonstrate the effectiveness of\nvisual context in conjunction with deep nets for aerial image understanding.", "comment": null, "links": []}
{"entry_id": "1604.04125", "title": "Filling in the details: Perceiving from low fidelity images", "authors": ["Farahnaz Ahmed Wick", "Michael L. Wick", "Marc Pomplun"], "published": "2016-04-14 12:10:23", "updated": "2016-04-14 12:10:23", "summary": "Humans perceive their surroundings in great detail even though most of our\nvisual field is reduced to low-fidelity color-deprived (e.g. dichromatic) input\nby the retina. In contrast, most deep learning architectures are\ncomputationally wasteful in that they consider every part of the input when\nperforming an image processing task. Yet, the human visual system is able to\nperform visual reasoning despite having only a small fovea of high visual\nacuity. With this in mind, we wish to understand the extent to which\nconnectionist architectures are able to learn from and reason with low acuity,\ndistorted inputs. Specifically, we train autoencoders to generate full-detail\nimages from low-detail \"foveations\" of those images and then measure their\nability to reconstruct the full-detail images from the foveated versions. By\nvarying the type of foveation, we can study how well the architectures can cope\nwith various types of distortion. We find that the autoencoder compensates for\nlower detail by learning increasingly global feature functions. In many cases,\nthe learnt features are suitable for reconstructing the original full-detail\nimage. For example, we find that the networks accurately perceive color in the\nperiphery, even when 75\\% of the input is achromatic.", "comment": null, "links": []}
{"entry_id": "1602.00753", "title": "Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects", "authors": ["Hessam Bagherinezhad", "Hannaneh Hajishirzi", "Yejin Choi", "Ali Farhadi"], "published": "2016-02-02 00:16:39", "updated": "2016-02-02 00:16:39", "summary": "Human vision greatly benefits from the information about sizes of objects.\nThe role of size in several visual reasoning tasks has been thoroughly explored\nin human perception and cognition. However, the impact of the information about\nsizes of objects is yet to be determined in AI. We postulate that this is\nmainly attributed to the lack of a comprehensive repository of size\ninformation. In this paper, we introduce a method to automatically infer object\nsizes, leveraging visual and textual information from web. By maximizing the\njoint likelihood of textual and visual observations, our method learns reliable\nrelative size estimates, with no explicit human supervision. We introduce the\nrelative size dataset and show that our method outperforms competitive textual\nand visual baselines in reasoning about size comparisons.", "comment": "To appear in AAAI 2016", "links": []}
{"entry_id": "1503.06813", "title": "Factorization of View-Object Manifolds for Joint Object Recognition and Pose Estimation", "authors": ["Haopeng Zhang", "Tarek El-Gaaly", "Ahmed Elgammal", "Zhiguo Jiang"], "published": "2015-03-23 20:05:36", "updated": "2015-04-13 02:59:41", "summary": "Due to large variations in shape, appearance, and viewing conditions, object\nrecognition is a key precursory challenge in the fields of object manipulation\nand robotic/AI visual reasoning in general. Recognizing object categories,\nparticular instances of objects and viewpoints/poses of objects are three\ncritical subproblems robots must solve in order to accurately grasp/manipulate\nobjects and reason about their environments. Multi-view images of the same\nobject lie on intrinsic low-dimensional manifolds in descriptor spaces (e.g.\nvisual/depth descriptor spaces). These object manifolds share the same topology\ndespite being geometrically different. Each object manifold can be represented\nas a deformed version of a unified manifold. The object manifolds can thus be\nparameterized by its homeomorphic mapping/reconstruction from the unified\nmanifold. In this work, we develop a novel framework to jointly solve the three\nchallenging recognition sub-problems, by explicitly modeling the deformations\nof object manifolds and factorizing it in a view-invariant space for\nrecognition. We perform extensive experiments on several challenging datasets\nand achieve state-of-the-art results.", "comment": null, "links": []}