Vibevoice 🎙️

Hi, I'm Marc Päpper and I wanted to vibe code like Karpathy ;D, so I looked around and found the cool work of Vlad. I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. I hope you have fun with it!

What it does 🚀

Simply run cli.py and start dictating text anywhere in your system:

Hold down right control key (Ctrl_r)
Speak your text
Release the key
Watch as your spoken words are transcribed and automatically typed!

Works in any application or window - your text editor, browser, chat apps, anywhere you can type!

NEW: LLM voice command mode:

Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
Speak what you want the LLM to do
The LLM receives your transcribed text and a screenshot of your current view
The LLM answer is typed into your keyboard (streamed)

Works everywhere on your system and the LLM always has the screen context

Installation 🛠️

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py

Requirements 📋

Python Dependencies

Python 3.13 or higher

System Requirements

CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
CUDA 12.x
cuBLAS
cuDNN 9.x
In case you get this error: OSError: PortAudio library not found run sudo apt install libportaudio2
Ollama for AI command mode (with multimodal models for screenshot support)

Setting up Ollama

Install Ollama by following the instructions at ollama.com

Pull a model that supports both text and images for best results:

ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar

Make sure Ollama is running in the background:
```
ollama serve
```

Handling the CUDA requirements

Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x
I had some trouble at first with Ubuntu 24.04, so I did the following:

sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
sudo apt install cuda-toolkit-12-8

or alternatively:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cudnn9-cuda-12

Then after rebooting, it worked well.

Usage 💡

Start the application:

python src/vibevoice/cli.py

Hold down right control key (Ctrl_r) while speaking
Release to transcribe
Your text appears wherever your cursor is!

Configuration

You can customize various aspects of VibeVoice with the following environment variables:

Keyboard Controls

VOICEKEY: Change the dictation activation key (default: "ctrl_r")
```
export VOICEKEY="ctrl"  # Use left control instead
```

VOICEKEY_CMD: Set the key for AI command mode (default: "scroll_lock")

export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key

AI and Screenshot Features

OLLAMA_MODEL: Specify which Ollama model to use (default: "gemma3:27b")

export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM

INCLUDE_SCREENSHOT: Enable or disable screenshots in AI command mode (default: "true")

export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)

SCREENSHOT_MAX_WIDTH: Set the maximum width for screenshots (default: "1024")
```
export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots
```

Screenshot Dependencies

To use the screenshot functionality:

sudo apt install gnome-screenshot

Usage Modes 💡

VibeVoice supports two modes:

1. Dictation Mode

Hold down the dictation key (default: right Control)
Speak your text
Release to transcribe
Your text appears wherever your cursor is!

2. AI Command Mode

Hold down the command key (default: Scroll Lock)
Ask a question or give a command
Release the key
The AI will analyze your request (and current screen if enabled) and type a response

Credits 🙏

Original inspiration: whisper-keyboard by Vlad
Faster Whisper for the optimized Whisper implementation
Built by Marc Päpper

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
src/vibevoice		src/vibevoice
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vibevoice 🎙️

What it does 🚀

Installation 🛠️

Requirements 📋

Python Dependencies

System Requirements

Setting up Ollama

Handling the CUDA requirements

Usage 💡

Configuration

Keyboard Controls

AI and Screenshot Features

Screenshot Dependencies

Usage Modes 💡

1. Dictation Mode

2. AI Command Mode

Credits 🙏

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mpaepper/vibevoice

Folders and files

Latest commit

History

Repository files navigation

Vibevoice 🎙️

What it does 🚀

Installation 🛠️

Requirements 📋

Python Dependencies

System Requirements

Setting up Ollama

Handling the CUDA requirements

Usage 💡

Configuration

Keyboard Controls

AI and Screenshot Features

Screenshot Dependencies

Usage Modes 💡

1. Dictation Mode

2. AI Command Mode

Credits 🙏

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages