Skip to content

[EMNLP 2025] SILVAR - Reasoning Speech Instruction with Large Visual Language for Object Localization and Text Generation

License

Notifications You must be signed in to change notification settings

Hanhpt23/SilVar

Repository files navigation

SILVAR 🚀

Reasoning Speech Instruction with a Visual Language Model

The first end-to-end speech-driven VLM for object localization & text generation! 🎙️➡️🖼️

SILVAR processes speech directly with a Whisper encoder, eliminating the need for intermediate text conversion before passing instructions to models like GPT-4o Mini or Gemini 1.5.

🔥 Key Highlights:
End-to-end Speech-to-Vision Reasoning – No text conversion needed!
State-of-the-Art Performance – Competes with top models on MMMU & ScienceQA.
Efficient & Scalable – Comparable results with fewer parameters.

📌 Why SILVAR?

Existing VLMs require text input, but SILVAR directly understands speech for object localization and reasoning, pushing the boundaries of speech-driven AI!

🛠️ Supported Models

SILVAR is designed for flexibility, allowing seamless integration with various state-of-the-art models. Currently, the supported models include:

  • 📝 Language Models: Mistral, Llama (2, 3, 3.1), Deepseek R1 (Distill Llama 8B)
  • 🖼️ Vision Encoders: CLIP and its variants (e.g., Biomed-CLIP)
  • 🎙️ Audio Encoders: Whisper and its variants

SILVAR is an end-to-end visual language model that uses speech as input instructions for reasoning visual question answering and object localization.


Installation

conda create -n silvar python=3.10.13
conda activate silvar
git clone https://github.com/Hanhpt23/SilVar.git
cd SilVar
pip install -r requirements.txt

Training

Visual encoder and audio encoder setting

We have released our checkpoint here, you can download and use it as a pretrained weight or for inference.

Training Configuration

  • Set the pretrained checkpoint for downstream tasks here at Line 10.
  • Set the training image path here at Line 35
  • Set the training annotation path here at Line 36
  • Set the training audio path here at Line 37
  • Set the output directory here at Line 54
  • Set the wandb token here at Line 69
  • If you want to train the model end-to-end, set freeze_vision and freeze_audio to False here on lines 17 and 18

Evaluation Configuration

  • Set the checkpoint here at Line 10.
  • Set the evaluation image path here at Line 36
  • Set the evaluation annotation path here at Line 35
  • Set the evaluation audio path here at Line 38
  • Set the output directory here at Line 54

Run

  • To run on a terminal:
torchrun --nproc_per_node 2 train.py \
        --cfg-path train_configs/train.yaml\
        --cfg-eval-path eval_configs/evaluate.yaml\
        --eval-dataset audio_val
  • To submit to an HPC:
sbatch scripts/silvar/train.sh

Evaluation

  • To run on a terminal:
torchrun --nproc_per_node 2 evaluate.py \
      --cfg-path eval_configs/evaluate.yaml\
      --eval-dataset audio_val
  • To submit to an HPC:
sbatch scripts/silvar/evaluate.sh

Dataset structure

Silvar
├── train
│   ├── audio
│   ├── images
│   ├── train.json
├── test
│   ├── audio
│   ├── images
│   ├── test.json

└── pretrained_checkpoint
    └── checkpoint_19.pth

About

[EMNLP 2025] SILVAR - Reasoning Speech Instruction with Large Visual Language for Object Localization and Text Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •