VoiceBridge 🎙️ ↔️ 📝

The ultimate bidirectional voice-text bridge. Seamlessly convert speech to text and text to speech with professional-grade accuracy, real-time processing, and hotkey-driven workflows.

🚀 What is VoiceBridge?

VoiceBridge eliminates the friction between voice and text. Whether you're transcribing interviews, creating accessible content, building voice-driven workflows, or simply need hands-free text input, VoiceBridge provides a powerful, flexible CLI that adapts to your needs.

Built on OpenAI's Whisper for world-class speech recognition and VibeVoice for natural text-to-speech synthesis.

🎯 What Problems Does It Solve?

Content Creators: Transcribe podcasts, interviews, and videos with timestamp precision
Accessibility: Convert text to natural speech for screen readers and audio content
Productivity: Voice-to-text note-taking with hotkey triggers during meetings
Developers: Integrate speech processing into applications and workflows
Researchers: Batch process audio data with confidence analysis and quality metrics
Writers: Dictate drafts and have articles read back with custom voices

✨ Key Features

🎤 Speech-to-Text (STT)

Real-time transcription with hotkeys (F9 toggle/hold modes)
Interactive mode with press-and-hold 'r' to record
File processing (MP3, WAV, M4A, FLAC, OGG) with chunked processing
Batch transcription of entire directories with parallel workers
Resume capability for interrupted long transcriptions with session management
Streaming transcription with real-time output and live updates
GPU acceleration (CUDA/Metal) with automatic device detection
Memory optimization with configurable limits and streaming
Custom vocabulary management for domain-specific terms
Export formats: JSON, SRT, VTT, plain text, CSV with timestamps and confidence
Confidence analysis and quality assessment with detailed reporting
Webhook integration for external notifications and automation
Post-processing with spell check, grammar correction, and custom rules
Profile management for different use cases and configurations
Performance monitoring with comprehensive metrics and benchmarking

🗣️ Text-to-Speech (TTS)

High-quality voice synthesis with VibeVoice neural models
Multiple input modes: clipboard monitoring, text selection, direct input
Custom voice samples with automatic detection and voice cloning
Streaming and non-streaming modes for real-time or complete generation
Daemon mode for background processing and system integration
Hotkey controls for hands-free operation (F12 generate, Ctrl+Alt+S stop)
Voice management with sample validation and quality checks
GPU acceleration for faster synthesis and model loading
Configuration profiles for different voice settings and use cases
Audio output options: play immediately, save to file, or both

🔧 Advanced Processing

Audio enhancement: noise reduction, normalization, silence trimming, fade effects
Audio splitting: by duration, silence detection, or file size with smart segmentation
Confidence analysis and quality assessment with detailed statistics
Session management with progress tracking, resume capability, and persistence
Performance monitoring with GPU benchmarking, memory usage, and operation tracking
Webhook integration for external notifications and workflow automation
Profile management for different use cases and quick configuration switching
Vocabulary management for improved recognition of technical terms and proper nouns
Post-processing pipeline with spell check, grammar correction, and custom rules
API server for integration with external applications and services
Comprehensive testing with E2E test suites for all major functionality

🚀 Quick Start

Installation

VoiceBridge uses uv for fast dependency management. Install uv first if you don't have it:

# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install with uv
uv pip install voicebridge

Basic Usage

# Listen for speech and transcribe with hotkeys
voicebridge stt listen

# Transcribe an audio file
voicebridge stt transcribe audio.mp3 --output transcript.txt

# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"

# Start clipboard monitoring for TTS
voicebridge tts listen-clipboard

📖 Examples

1. Content Creator Workflow

# Transcribe a podcast episode with timestamps
voicebridge stt transcribe podcast_episode.mp3 \
  --format srt \
  --output episode_subtitles.srt \
  --language en

# Analyze transcription quality
voicebridge stt confidence analyze session_12345 --detailed

2. Accessibility Content

# Convert article to speech with custom voice
voicebridge tts generate \
  --voice en-Alice_woman \
  --output article_audio.wav \
  "$(cat article.txt)"

# Batch convert multiple documents
voicebridge stt batch-transcribe articles/ \
  --output-dir transcripts/ \
  --workers 4

3. Developer Integration

# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard

# Set up webhook notifications
voicebridge stt webhook add https://api.example.com/transcription-complete

# Real-time transcription with streaming
voicebridge stt realtime \
  --chunk-duration 2.0 \
  --output-format live

4. Research & Analysis

# Process interview recordings with resumable capability
voicebridge stt listen-resumable interview.wav \
  --session-name "interview-2024-01-15" \
  --language en

# Export results in multiple formats
voicebridge stt export session session_12345 \
  --format json \
  --include-confidence \
  --output transcript.json

🛠️ Local Development Setup

Prerequisites

Python 3.10+
uv (Python package manager)
FFmpeg (for audio processing)
CUDA (optional, for GPU acceleration)

Installation

# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/yourusername/voicebridge.git
cd voicebridge

# 3. Choose your setup:
make prepare        # CPU version
make prepare-cuda   # With CUDA support
make prepare-tray   # With system tray support

# 4. Install system dependencies
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg

# macOS:
brew install ffmpeg

# Windows (with Chocolatey):
choco install ffmpeg

TTS Setup

VoiceBridge includes comprehensive text-to-speech capabilities powered by VibeVoice.

Prerequisites

Install VibeVoice dependencies (if using local model):

# Clone and install VibeVoice
git clone https://github.com/WestZhang/VibeVoice.git
cd VibeVoice
pip install -e .

Voice Samples: Voice samples are included in voices/ directory:

voices/
├── en-Alice_woman.wav
├── en-Carter_man.wav
├── en-Frank_man.wav
├── en-Maya_woman.wav
├── en-Patrick.wav
└── ... (additional voices)

Configuration

VoiceBridge works out-of-the-box with sensible defaults. Configuration can be set via:

Config file (~/.config/voicebridge/config.json):

{
  "tts_enabled": true,
  "tts_config": {
    "model_path": "aoi-ot/VibeVoice-7B",
    "voice_samples_dir": "voices",
    "default_voice": "en-Alice_woman",
    "cfg_scale": 1.3,
    "inference_steps": 10,
    "tts_mode": "clipboard",
    "streaming_mode": "non_streaming",
    "output_mode": "play",
    "tts_toggle_key": "f11",
    "tts_generate_key": "f12",
    "tts_stop_key": "ctrl+alt+s",
    "sample_rate": 24000,
    "auto_play": true,
    "use_gpu": true,
    "max_text_length": 2000,
    "chunk_text_threshold": 500
  }
}

Command-line flags (override config file):

# Generate with custom settings
voicebridge tts generate "Hello world" \
  --voice en-Patrick \
  --streaming \
  --output speech.wav \
  --cfg-scale 1.5 \
  --inference-steps 15

Voice Sample Requirements

Format: WAV (recommended), MP3, FLAC
Sample Rate: 24kHz (recommended), 16kHz-48kHz supported
Channels: Mono (preferred)
Duration: 3-10 seconds
Quality: Clear, single speaker, minimal background noise
Naming: language-name_gender.wav (e.g., en-Alice_woman.wav)

Quick Test

# Test TTS with default settings
voicebridge tts generate "Hello, this is VoiceBridge text-to-speech!"

# List available voices
voicebridge tts voices

# Show current TTS configuration
voicebridge tts config show

Development Commands

make help           # Show all available commands
make lint           # Run ruff linting and formatting
make test           # Run all tests with coverage
make test-fast      # Quick tests without coverage
make test-unit      # Run only unit tests (exclude e2e)
make test-e2e       # Run comprehensive end-to-end tests
make test-e2e-smoke # Run quick E2E smoke tests
make test-e2e-stt   # Run STT E2E tests only
make test-e2e-tts   # Run TTS E2E tests only
make test-e2e-audio # Run audio E2E tests only
make test-e2e-gpu   # Run GPU E2E tests only
make test-e2e-api   # Run API E2E tests only
make clean          # Clean cache and temporary files

Configuration

# Show current STT configuration
voicebridge stt config show

# Set STT configuration values
voicebridge stt config set use_gpu true

# Show TTS configuration
voicebridge tts config show

# Set up profiles for different use cases
voicebridge stt profile save research-setup
voicebridge stt profile load research-setup

🎮 Usage Guide

Speech-to-Text (STT) Commands

Real-time Recognition

# Listen with hotkeys (F9 to start/stop)
voicebridge stt listen

# Interactive mode (press 'r' to record)
voicebridge stt interactive

# Global hotkey listener with custom key
voicebridge stt hotkey --key f9 --mode toggle

File Processing

# Transcribe single file
voicebridge stt transcribe audio.mp3 --output transcript.txt

# Batch process directory
voicebridge stt batch-transcribe /path/to/audio/ --workers 4

# Long file with resume capability
voicebridge stt listen-resumable large_file.wav --session-name "my-session"

# Real-time streaming
voicebridge stt realtime --chunk-duration 2.0 --output-format live

Session Management

# List all sessions
voicebridge stt sessions list

# Resume interrupted session
voicebridge stt sessions resume --session-name "my-session"

# Clean up old sessions
voicebridge stt sessions cleanup

# Delete specific session
voicebridge stt sessions delete session_id

Advanced Features

# Add vocabulary words for better recognition
voicebridge stt vocabulary add "technical,terms,here" --type technical

# Export with confidence analysis
voicebridge stt export session session_id --format srt --confidence

# Set up webhooks for notifications
voicebridge stt webhook add https://api.example.com/notify

Text-to-Speech (TTS) Commands

Basic Generation

# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"

# Use specific voice and save to file
voicebridge tts generate "Hello world" --voice en-Alice_woman --output speech.wav

# Generate speech from a text file
voicebridge tts generate-file document.txt --output document.wav
voicebridge tts generate-file article.md --voice en-Patrick --streaming

# List available voices
voicebridge tts voices

Background Monitoring

# Monitor clipboard for text changes
voicebridge tts listen-clipboard --streaming

# Monitor text selections (use hotkey to trigger)
voicebridge tts listen-selection

# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard
voicebridge tts daemon status
voicebridge tts daemon stop

Configuration

# Show TTS settings
voicebridge tts config show

# Configure TTS settings
voicebridge tts config set --default-voice en-Alice_woman --cfg-scale 1.5

Audio Processing

# Get audio file information
voicebridge audio info audio.mp3

# List supported formats
voicebridge audio formats

# Split large audio file
voicebridge audio split recording.mp3 \
  --method duration \
  --chunk-duration 300

# Enhance audio quality
voicebridge audio preprocess input.wav output.wav \
  --noise-reduction 0.8 \
  --normalize \
  --trim-silence

# Test audio setup
voicebridge audio test

System & Performance

# Check GPU status and acceleration
voicebridge gpu status

# Benchmark GPU performance
voicebridge gpu benchmark --model base

# View STT performance statistics
voicebridge stt performance stats

# Manage active operations
voicebridge stt operations list
voicebridge stt operations cancel operation_id

API Server

# Start API server
voicebridge api start --host localhost --port 8000

# Check API status
voicebridge api status

# Get API information
voicebridge api info

# Stop API server
voicebridge api stop

📋 Complete Command Reference

VoiceBridge uses a hierarchical command structure with five main categories:

🎤 `stt` - Speech-to-Text Commands

stt listen              # Real-time transcription with hotkeys
stt interactive         # Press-and-hold 'r' to record mode
stt hotkey              # Global hotkey listener
stt transcribe          # Transcribe single audio file
stt batch-transcribe    # Batch process directory
stt listen-resumable    # Long file with resume capability
stt realtime            # Real-time streaming transcription

# Session Management
stt sessions list       # List all sessions
stt sessions resume     # Resume interrupted session
stt sessions cleanup    # Clean up old sessions
stt sessions delete     # Delete specific session

# Advanced Features
stt vocabulary add      # Add custom vocabulary
stt vocabulary remove   # Remove vocabulary
stt vocabulary list     # List vocabulary
stt vocabulary import   # Import from file
stt vocabulary export   # Export to file

stt export session      # Export session data
stt export formats      # List export formats

stt confidence analyze  # Analyze transcription confidence
stt confidence analyze-all # Analyze all sessions

stt postproc config     # Configure post-processing
stt postproc test       # Test post-processing

stt webhook add         # Add webhook notification
stt webhook remove      # Remove webhook
stt webhook list        # List webhooks
stt webhook test        # Test webhook

stt performance stats   # Performance statistics
stt operations list     # List active operations
stt operations cancel   # Cancel operation
stt operations status   # Check operation status

stt config show         # Show configuration
stt config set          # Set configuration

stt profile save        # Save configuration profile
stt profile load        # Load configuration profile
stt profile list        # List profiles
stt profile delete      # Delete profile

🗣️ `tts` - Text-to-Speech Commands

tts generate            # Generate speech from text
tts generate-file       # Generate speech from text file (txt, md, etc.)
tts listen-clipboard    # Monitor clipboard changes
tts listen-selection    # Monitor text selections with hotkey
tts voices              # List available voices

# Daemon Management
tts daemon start        # Start TTS daemon
tts daemon stop         # Stop TTS daemon
tts daemon status       # Check daemon status

# Configuration
tts config show         # Show TTS configuration
tts config set          # Configure TTS settings

🔊 `audio` - Audio Processing Commands

audio info              # Show audio file information
audio formats           # List supported formats
audio split             # Split audio file into chunks
audio preprocess        # Enhance audio quality
audio test              # Test audio setup

🖥️ `gpu` - GPU and System Commands

gpu status              # Show GPU status
gpu benchmark           # Benchmark GPU performance

🌐 `api` - API Server Management

api start               # Start API server
api stop                # Stop API server
api status              # Check API status
api info                # Show API information

🏗️ Architecture

VoiceBridge follows hexagonal architecture principles:

voicebridge/
├── domain/          # Core business logic and models
├── ports/           # Interfaces and abstractions
├── adapters/        # External integrations (Whisper, VibeVoice, etc.)
├── services/        # Application services and orchestration
├── cli/             # Command-line interface
└── tests/          # Comprehensive test suite

Key Components

Domain Layer: Core models, configurations, and business rules
Ports: Abstract interfaces for transcription, TTS, audio processing
Adapters: Concrete implementations for Whisper, VibeVoice, FFmpeg
Services: Orchestration, session management, performance monitoring
CLI: Typer-based command interface with sub-commands

🤝 Contributing

We welcome contributions! Here's how to get started:

Development Workflow

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Install development dependencies: make install-dev
Make your changes following our coding standards
Test your changes: make test
Lint your code: make lint
Commit your changes: git commit -m 'Add amazing feature'
Push to your branch: git push origin feature/amazing-feature
Open a Pull Request

Coding Standards

Python 3.10+ with comprehensive type hints
uv for fast dependency management and virtual environments
Ruff for linting and formatting (replaces Black and isort)
Pytest for testing with >90% coverage target
Hexagonal architecture for new features and clean separation of concerns
Comprehensive documentation for public APIs and CLI commands
E2E testing for all major CLI workflows and functionality
Makefile for standardized development commands

Areas for Contribution

🎯 New audio formats and processing capabilities
🌍 Language support and localization
🔧 Performance optimizations and GPU utilization
📱 Platform integrations (mobile, web interfaces)
🧪 Test coverage and edge case handling
📚 Documentation and usage examples
🎨 Voice samples and TTS improvements

Reporting Issues

Please use our issue templates:

🐛 Bug Report: Describe the issue with reproduction steps
💡 Feature Request: Propose new functionality
📚 Documentation: Report unclear or missing docs
🏃 Performance: Report slow or resource-intensive operations

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper - State-of-the-art speech recognition
VibeVoice - High-quality text-to-speech synthesis
FFmpeg - Comprehensive audio processing
Typer - Modern CLI framework
PyTorch - Machine learning infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
vibevoice		vibevoice
voicebridge		voicebridge
voices		voices
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_e2e_tests.py		run_e2e_tests.py
test_requirements.txt		test_requirements.txt
uv.lock		uv.lock

License

PatrickKoss/VoiceBridge

Folders and files

Latest commit

History

Repository files navigation

VoiceBridge 🎙️ ↔️ 📝

🚀 What is VoiceBridge?

🎯 What Problems Does It Solve?

✨ Key Features

🎤 Speech-to-Text (STT)

🗣️ Text-to-Speech (TTS)

🔧 Advanced Processing

🚀 Quick Start

Installation

Basic Usage

📖 Examples

1. Content Creator Workflow

2. Accessibility Content

3. Developer Integration

4. Research & Analysis

🛠️ Local Development Setup

Prerequisites

Installation

TTS Setup

Prerequisites

Configuration

Voice Sample Requirements

Quick Test

Development Commands

Configuration

🎮 Usage Guide

Speech-to-Text (STT) Commands

Real-time Recognition

File Processing

Session Management

Advanced Features

Text-to-Speech (TTS) Commands

Basic Generation

Background Monitoring

Configuration

Audio Processing

System & Performance

API Server

📋 Complete Command Reference

🎤 stt - Speech-to-Text Commands

🗣️ tts - Text-to-Speech Commands

🔊 audio - Audio Processing Commands

🖥️ gpu - GPU and System Commands

🌐 api - API Server Management

🏗️ Architecture

Key Components

🤝 Contributing

Development Workflow

Coding Standards

Areas for Contribution

Reporting Issues

📜 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

🎤 `stt` - Speech-to-Text Commands

🗣️ `tts` - Text-to-Speech Commands

🔊 `audio` - Audio Processing Commands

🖥️ `gpu` - GPU and System Commands

🌐 `api` - API Server Management

Packages