SILVAR 🚀

Reasoning Speech Instruction with a Visual Language Model

The first end-to-end speech-driven VLM for object localization & text generation! 🎙️➡️🖼️

SILVAR processes speech directly with a Whisper encoder, eliminating the need for intermediate text conversion before passing instructions to models like GPT-4o Mini or Gemini 1.5.

🔥 Key Highlights:
✅ End-to-end Speech-to-Vision Reasoning – No text conversion needed!
✅ State-of-the-Art Performance – Competes with top models on MMMU & ScienceQA.
✅ Efficient & Scalable – Comparable results with fewer parameters.

📌 Why SILVAR?

Existing VLMs require text input, but SILVAR directly understands speech for object localization and reasoning, pushing the boundaries of speech-driven AI!

🛠️ Supported Models

SILVAR is designed for flexibility, allowing seamless integration with various state-of-the-art models. Currently, the supported models include:

📝 Language Models: Mistral, Llama (2, 3, 3.1), Deepseek R1 (Distill Llama 8B)
🖼️ Vision Encoders: CLIP and its variants (e.g., Biomed-CLIP)
🎙️ Audio Encoders: Whisper and its variants

SILVAR is an end-to-end visual language model that uses speech as input instructions for reasoning visual question answering and object localization.

Installation

conda create -n silvar python=3.10.13
conda activate silvar
git clone https://github.com/Hanhpt23/SilVar.git
cd SilVar
pip install -r requirements.txt

Training

Visual encoder and audio encoder setting

We have released our checkpoint here, you can download and use it as a pretrained weight or for inference.

Training Configuration

Set the pretrained checkpoint for downstream tasks here at Line 10.
Set the training image path here at Line 35
Set the training annotation path here at Line 36
Set the training audio path here at Line 37
Set the output directory here at Line 54
Set the wandb token here at Line 69
If you want to train the model end-to-end, set freeze_vision and freeze_audio to False here on lines 17 and 18

Evaluation Configuration

Set the checkpoint here at Line 10.
Set the evaluation image path here at Line 36
Set the evaluation annotation path here at Line 35
Set the evaluation audio path here at Line 38
Set the output directory here at Line 54

Run

To run on a terminal:

torchrun --nproc_per_node 2 train.py \
        --cfg-path train_configs/train.yaml\
        --cfg-eval-path eval_configs/evaluate.yaml\
        --eval-dataset audio_val

To submit to an HPC:

sbatch scripts/silvar/train.sh

Evaluation

To run on a terminal:

torchrun --nproc_per_node 2 evaluate.py \
      --cfg-path eval_configs/evaluate.yaml\
      --eval-dataset audio_val

To submit to an HPC:

sbatch scripts/silvar/evaluate.sh

Dataset structure

Silvar
├── train
│   ├── audio
│   ├── images
│   ├── train.json
├── test
│   ├── audio
│   ├── images
│   ├── test.json

└── pretrained_checkpoint
    └── checkpoint_19.pth

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
docs		docs
eval_configs		eval_configs
image		image
scripts/silvar		scripts/silvar
silvar		silvar
train_configs		train_configs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
deploy.py		deploy.py
evaluate.py		evaluate.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SILVAR 🚀

Reasoning Speech Instruction with a Visual Language Model

📌 Why SILVAR?

🛠️ Supported Models

Installation

Training

Visual encoder and audio encoder setting

Training Configuration

Evaluation Configuration

Run

Evaluation

Dataset structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Hanhpt23/SilVar

Folders and files

Latest commit

History

Repository files navigation

SILVAR 🚀

Reasoning Speech Instruction with a Visual Language Model

📌 Why SILVAR?

🛠️ Supported Models

Installation

Training

Visual encoder and audio encoder setting

Training Configuration

Evaluation Configuration

Run

Evaluation

Dataset structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages