Skip to content

mpaepper/vibevoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vibevoice 🎙️

Hi, I'm Marc Päpper and I wanted to vibe code like Karpathy ;D, so I looked around and found the cool work of Vlad. I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. I hope you have fun with it!

What it does 🚀

Demo Video

Simply run cli.py and start dictating text anywhere in your system:

  1. Hold down right control key (Ctrl_r)
  2. Speak your text
  3. Release the key
  4. Watch as your spoken words are transcribed and automatically typed!

Works in any application or window - your text editor, browser, chat apps, anywhere you can type!

NEW: LLM voice command mode:

  1. Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
  2. Speak what you want the LLM to do
  3. The LLM receives your transcribed text and a screenshot of your current view
  4. The LLM answer is typed into your keyboard (streamed)

Works everywhere on your system and the LLM always has the screen context

Installation 🛠️

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py

Requirements 📋

Python Dependencies

  • Python 3.12 or higher

System Requirements

  • CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
  • CUDA 12.x
  • cuBLAS
  • cuDNN 9.x
  • In case you get this error: OSError: PortAudio library not found run sudo apt install libportaudio2
  • Ollama for AI command mode (with multimodal models for screenshot support)

Setting up Ollama

  1. Install Ollama by following the instructions at ollama.com
  2. Pull a model that supports both text and images for best results:
    ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar
  3. Make sure Ollama is running in the background:
    ollama serve

Handling the CUDA requirements

  • Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x
  • I had some trouble at first with Ubuntu 24.04, so I did the following:
sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
sudo apt install cuda-toolkit-12-8

or alternatively:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cudnn9-cuda-12
  • Then after rebooting, it worked well.

Usage 💡

  1. Start the application:
python src/vibevoice/cli.py
  1. Hold down right control key (Ctrl_r) while speaking
  2. Release to transcribe
  3. Your text appears wherever your cursor is!

Configuration

You can customize various aspects of VibeVoice with the following environment variables:

Keyboard Controls

  • VOICEKEY: Change the dictation activation key (default: "ctrl_r")
    export VOICEKEY="ctrl"  # Use left control instead
  • VOICEKEY_CMD: Set the key for AI command mode (default: "scroll_lock")
    export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key

AI and Screenshot Features

  • OLLAMA_MODEL: Specify which Ollama model to use (default: "gemma3:27b")
    export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM
  • INCLUDE_SCREENSHOT: Enable or disable screenshots in AI command mode (default: "true")
    export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)
  • SCREENSHOT_MAX_WIDTH: Set the maximum width for screenshots (default: "1024")
    export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots

Screenshot Dependencies

To use the screenshot functionality:

sudo apt install gnome-screenshot

Usage Modes 💡

VibeVoice supports two modes:

1. Dictation Mode

  1. Hold down the dictation key (default: right Control)
  2. Speak your text
  3. Release to transcribe
  4. Your text appears wherever your cursor is!

2. AI Command Mode

  1. Hold down the command key (default: Scroll Lock)
  2. Ask a question or give a command
  3. Release the key
  4. The AI will analyze your request (and current screen if enabled) and type a response

Credits 🙏

Releases

No releases published

Packages

No packages published

Languages