Skip to content

ServiceNow/FM2DS

Repository files navigation

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering


Paper Website Huggingface

Abstract

Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.

image

In contrast to traditional datasets that depend on human annotators, templates, and information snippets as sources, FM2DS is a fully automated approach that utilizes complete documents as its sources. FM2DS incorporates validation steps to ensure that the generated questions are answerable, multimodal, and multihop.

FM2DS

image

The Five-Stage Pipeline for FM2DS. First we retrieve relevant documents from the Wikipedia dataset to create a pool of related documents based on hyperlinks and topics (Stage 1). In Stage2, we select the few-shot samples from MultiModalQA (MMQA in the figure). Stage 3 focuses on generating and validating questions to make sure they are answerable, multihop, and multimodal. In Stage 4, answers are generated and validated. Finally, in Stage 5 we generate queries related to the documents, which are also validated to ensure relevance and accuracy.

M2QA-Bench

We also propose a benchmark, M2QA, to assess the LVLMs performance on a more complicated MMQA task with full documents. M2QA consists of 500 Q&A pairs, each designed to challenge the model's ability to perform a complex reasoning task. The questions are not templated into a specific structure (as in some existing works like MultimodalQA), instead, they are diverse and challenging. Additionally, answering the questions require access to full documents, where both information extraction and reasoning across different modalities (e.g., images and tables) are essential.

image

Multimodal multihop reasoning example from M2QA-Bench where the model compares the release dates of two albums, "Music from Big Pink" and "Imagine," using textual and visual cues. The documents are connected through their shared topic, "music," and the answer is determined as the title of the earlier-released album.

You can use this link to access this benchmark.

How to Run

This guide provides step-by-step instructions for running the FM²DS pipeline to synthesize multimodal multihop question answering data.

Overview

Important Note: This project is designed specifically for data synthesis. The generated dataset can be used to train various multimodal models, but the actual model training is not included in this repository. For model training, please refer to each model's specific training approaches and documentation.

Prerequisites

System Requirements

  • Python 3.8+
  • CUDA-compatible GPU (recommended for LVLM inference, especially for local Llama models)
  • Sufficient storage space for datasets (~50GB+)
  • 16GB+ RAM recommended for processing large datasets

Installation

  1. Clone the repository:

    git clone https://github.com/ServiceNow/FM2DS.git
    cd FM2DS
  2. Create and activate a virtual environment (recommended):

    python -m venv fm2ds_env
    source fm2ds_env/bin/activate  # On Windows: fm2ds_env\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download required language model:

    python -m spacy download en_core_web_sm

Dependencies

Quick Installation

Install all required dependencies using the provided requirements file:

pip install -r requirements.txt
Manual Installation

Alternatively, install the core dependencies manually:

# Core ML and data processing libraries
pip install datasets>=2.14.0 transformers>=4.30.0 torch>=2.0.0 tensorflow>=2.10.0
pip install numpy>=1.21.0 scikit-learn>=1.0.0 beautifulsoup4>=4.9.0 requests>=2.25.0

# Natural Language Processing
pip install spacy>=3.4.0

# Download spaCy language model
python -m spacy download en_core_web_sm
Model API Dependencies

For specific model APIs, ensure you have the appropriate packages:

  • OpenAI GPT: pip install openai>=1.0.0
  • Anthropic Claude: pip install anthropic>=0.7.0
  • Local Llama: pip install vllm>=0.2.0 (requires CUDA-compatible GPU)
Optional Dependencies

For development and enhanced functionality:

pip install jupyter>=1.0.0 matplotlib>=3.5.0 pillow>=8.0.0

Troubleshooting Installation

Common Issues:

  1. PyTorch CUDA compatibility: If you have a CUDA-compatible GPU, install PyTorch with CUDA support:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  2. TensorFlow GPU support: For GPU acceleration with TensorFlow:

    pip install tensorflow[and-cuda]
  3. vLLM installation issues: vLLM requires specific CUDA versions. Check the vLLM installation guide for your system.

  4. Memory issues: If you encounter out-of-memory errors during dataset processing, consider:

    • Reducing batch sizes in the configuration
    • Using a machine with more RAM
    • Processing smaller subsets of the data initially

Setup and Data Preparation

Step 0: Install Dependencies and Download Models

Install all dependencies and download the required spaCy language model:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Step 1: Download Required Datasets

1.1 Download WikiWeb2M Dataset
cd data/
bash download_wikiweb2m.sh
1.2 Download MultiModalQA Training Data
cd create_few_shot_samples/
bash download_mmqa_train.sh

Step 2: Parse and Prepare Base Dataset

# Parse WikiWeb2M dataset and save as HuggingFace format
python data/parse_and_save_dataset.py

Step 3: Create Few-Shot Examples

# Create few-shot examples from MultiModalQA
python create_few_shot_samples/create_few_shot_from_multimodalqa.py

Step 4: Create Document Pool

# Create pools of related documents for multihop reasoning
python data/create_document_pool.py

Running Data Synthesis

Configure Model Settings

Choose one of the following language models:

Option 1: OpenAI GPT (Recommended)

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"
Option 2: Anthropic Claude

Set your Anthropic API key:

export ANTHROPIC_API_KEY="your-api-key-here"
Option 3: Local Llama Model

Start the Llama server:

# For Llama 3.1
bash lvlm/llama/host_llama_3_1.sh

# For Llama 3.2
bash lvlm/llama/host_llama_3_2.sh

Generate Synthetic Dataset

Run the main data synthesis pipeline:

python src/create_dataset.py \
    --model gpt \
    --num-few-shot 1 \
    --num-examples 5000 \
    --output-dataset FM2DS/data/generated_data/synth

Parameters:

  • --model: Choose from gpt, claude, or llama
  • --num-few-shot: Number of few-shot examples (default: 1)
  • --num-examples: Total number of examples to generate (default: 5000)
  • --output-dataset: Output directory for generated dataset

Generated Data Format

The synthesized dataset contains the following structure:

{
    "question": "Which country is ranked lower in EuroCup Basketball Performance...",
    "answer": "France",
    "documents": [
        {
            "title": "Document Title",
            "content": [
                {"type": "text", "value": "Text content here..."},
                {"type": "image", "value": "http://example.com/image.jpg"}
            ]
        }
    ],
    "query": ["step-by-step reasoning process", "explanation of answer derivation"]
}

Using the Data for Model Training

Important Training Considerations

⚠️ Critical for Model Training: When training multimodal models with this data, include both the question-answer pairs AND the generated queries. The queries contain step-by-step reasoning that is essential for teaching models multihop reasoning capabilities.

Data Conversion Scripts

Below are example Python scripts to convert the FM²DS data format for specific model training:

Example: Converting for InternVL2 Training
# convert_for_internvl2.py
import json
from datasets import load_from_disk

def convert_fm2ds_to_internvl2(input_dataset_path, output_file):
    """
    Convert FM2DS dataset to InternVL2 training format
    """
    dataset = load_from_disk(input_dataset_path)
    converted_data = []
    
    for example in dataset:
        # Extract images from documents
        images = []
        text_content = ""
        
        for doc in example['documents']:
            for content in doc['content']:
                if content['type'] == 'image':
                    images.append(content['value'])
                elif content['type'] == 'text':
                    text_content += content['value'] + " "
        
        # Create InternVL2 format with question, answer, and reasoning
        reasoning_steps = " ".join(example['query']) if isinstance(example['query'], list) else example['query']
        
        internvl_example = {
            "id": f"fm2ds_{len(converted_data)}",
            "image": images[0] if images else None,  # InternVL2 typically uses single image
            "conversations": [
                {
                    "from": "human",
                    "value": f"Context: {text_content.strip()}\n\nQuestion: {example['question']}\n\nPlease provide step-by-step reasoning and then the final answer."
                },
                {
                    "from": "gpt", 
                    "value": f"Reasoning: {reasoning_steps}\n\nAnswer: {example['answer']}"
                }
            ]
        }
        converted_data.append(internvl_example)
    
    # Save in JSONL format
    with open(output_file, 'w') as f:
        for item in converted_data:
            f.write(json.dumps(item) + '\n')
    
    print(f"Converted {len(converted_data)} examples to {output_file}")

# Usage
convert_fm2ds_to_internvl2("FM2DS/data/generated_data/synth", "internvl2_training_data.jsonl")
Example: Converting for Generic VLM Training
# convert_for_generic_vlm.py
import json
from datasets import load_from_disk

def convert_fm2ds_to_generic_vlm(input_dataset_path, output_file):
    """
    Convert FM2DS dataset to generic VLM training format
    """
    dataset = load_from_disk(input_dataset_path)
    converted_data = []
    
    for example in dataset:
        # Prepare multimodal input
        multimodal_input = {
            "text_documents": [],
            "images": [],
            "question": example['question'],
            "reasoning_steps": example['query'],
            "answer": example['answer']
        }
        
        for doc in example['documents']:
            text_parts = []
            for content in doc['content']:
                if content['type'] == 'text':
                    text_parts.append(content['value'])
                elif content['type'] == 'image':
                    multimodal_input['images'].append({
                        "url": content['value'],
                        "caption": ""  # Add caption if available
                    })
            
            if text_parts:
                multimodal_input['text_documents'].append({
                    "title": doc['title'],
                    "content": " ".join(text_parts)
                })
        
        converted_data.append(multimodal_input)
    
    with open(output_file, 'w') as f:
        json.dump(converted_data, f, indent=2)
    
    print(f"Converted {len(converted_data)} examples to {output_file}")

# Usage
convert_fm2ds_to_generic_vlm("FM2DS/data/generated_data/synth", "generic_vlm_training_data.json")

Training Recommendations

  1. Include Reasoning Steps: Always incorporate the generated queries/reasoning steps in your training data
  2. Multimodal Alignment: Ensure your model can process both text and images from the documents
  3. Multihop Training: Structure training to encourage step-by-step reasoning across multiple documents
  4. Validation: Use the provided M²QA-Bench (M2QA_Bench.json) for evaluation

Evaluation

Use the M²QA-Bench for evaluating trained models:

import json

# Load benchmark
with open('M2QA_Bench.json', 'r') as f:
    benchmark = json.load(f)

# Each item contains:
# - question: The question to answer
# - answer: Ground truth answer  
# - modalities: Required modalities (text, image, table)
# - pages: Source Wikipedia pages

Performance Tips

  • Use --num-few-shot 3 for better generation quality
  • Start with smaller --num-examples for testing
  • Monitor validation success rates in the generation logs

Citation

@inproceedings{
  abaskohi2025fmds,
  title={{FM}2{DS}: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering},
  author={Amirhossein Abaskohi and Spandana Gella and Giuseppe Carenini and Issam H. Laradji},
  booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  url={https://openreview.net/forum?id=esIjdsJQtC}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published