The first end-to-end speech-driven VLM for object localization & text generation! 🎙️➡️🖼️
SILVAR processes speech directly with a Whisper encoder, eliminating the need for intermediate text conversion before passing instructions to models like GPT-4o Mini or Gemini 1.5.
🔥 Key Highlights:
✅ End-to-end Speech-to-Vision Reasoning – No text conversion needed!
✅ State-of-the-Art Performance – Competes with top models on MMMU & ScienceQA.
✅ Efficient & Scalable – Comparable results with fewer parameters.
Existing VLMs require text input, but SILVAR directly understands speech for object localization and reasoning, pushing the boundaries of speech-driven AI!
SILVAR is designed for flexibility, allowing seamless integration with various state-of-the-art models. Currently, the supported models include:
- 📝 Language Models: Mistral, Llama (2, 3, 3.1), Deepseek R1 (Distill Llama 8B)
- 🖼️ Vision Encoders: CLIP and its variants (e.g., Biomed-CLIP)
- 🎙️ Audio Encoders: Whisper and its variants
SILVAR is an end-to-end visual language model that uses speech as input instructions for reasoning visual question answering and object localization.
conda create -n silvar python=3.10.13
conda activate silvar
git clone https://github.com/Hanhpt23/SilVar.git
cd SilVar
pip install -r requirements.txtWe have released our checkpoint here, you can download and use it as a pretrained weight or for inference.
- Set the pretrained checkpoint for downstream tasks here at Line 10.
- Set the training image path here at Line 35
- Set the training annotation path here at Line 36
- Set the training audio path here at Line 37
- Set the output directory here at Line 54
- Set the wandb token here at Line 69
- If you want to train the model end-to-end, set
freeze_visionandfreeze_audiotoFalsehere on lines 17 and 18
- Set the checkpoint here at Line 10.
- Set the evaluation image path here at Line 36
- Set the evaluation annotation path here at Line 35
- Set the evaluation audio path here at Line 38
- Set the output directory here at Line 54
- To run on a terminal:
torchrun --nproc_per_node 2 train.py \
--cfg-path train_configs/train.yaml\
--cfg-eval-path eval_configs/evaluate.yaml\
--eval-dataset audio_val- To submit to an HPC:
sbatch scripts/silvar/train.sh- To run on a terminal:
torchrun --nproc_per_node 2 evaluate.py \
--cfg-path eval_configs/evaluate.yaml\
--eval-dataset audio_val- To submit to an HPC:
sbatch scripts/silvar/evaluate.shSilvar
├── train
│ ├── audio
│ ├── images
│ ├── train.json
├── test
│ ├── audio
│ ├── images
│ ├── test.json
└── pretrained_checkpoint
└── checkpoint_19.pth

