The ultimate bidirectional voice-text bridge. Seamlessly convert speech to text and text to speech with professional-grade accuracy, real-time processing, and hotkey-driven workflows.
VoiceBridge eliminates the friction between voice and text. Whether you're transcribing interviews, creating accessible content, building voice-driven workflows, or simply need hands-free text input, VoiceBridge provides a powerful, flexible CLI that adapts to your needs.
Built on OpenAI's Whisper for world-class speech recognition and VibeVoice for natural text-to-speech synthesis.
- Content Creators: Transcribe podcasts, interviews, and videos with timestamp precision
- Accessibility: Convert text to natural speech for screen readers and audio content
- Productivity: Voice-to-text note-taking with hotkey triggers during meetings
- Developers: Integrate speech processing into applications and workflows
- Researchers: Batch process audio data with confidence analysis and quality metrics
- Writers: Dictate drafts and have articles read back with custom voices
- Real-time transcription with hotkeys (F9 toggle/hold modes)
- Interactive mode with press-and-hold 'r' to record
- File processing (MP3, WAV, M4A, FLAC, OGG) with chunked processing
- Batch transcription of entire directories with parallel workers
- Resume capability for interrupted long transcriptions with session management
- Streaming transcription with real-time output and live updates
- GPU acceleration (CUDA/Metal) with automatic device detection
- Memory optimization with configurable limits and streaming
- Custom vocabulary management for domain-specific terms
- Export formats: JSON, SRT, VTT, plain text, CSV with timestamps and confidence
- Confidence analysis and quality assessment with detailed reporting
- Webhook integration for external notifications and automation
- Post-processing with spell check, grammar correction, and custom rules
- Profile management for different use cases and configurations
- Performance monitoring with comprehensive metrics and benchmarking
- High-quality voice synthesis with VibeVoice neural models
- Multiple input modes: clipboard monitoring, text selection, direct input
- Custom voice samples with automatic detection and voice cloning
- Streaming and non-streaming modes for real-time or complete generation
- Daemon mode for background processing and system integration
- Hotkey controls for hands-free operation (F12 generate, Ctrl+Alt+S stop)
- Voice management with sample validation and quality checks
- GPU acceleration for faster synthesis and model loading
- Configuration profiles for different voice settings and use cases
- Audio output options: play immediately, save to file, or both
- Audio enhancement: noise reduction, normalization, silence trimming, fade effects
- Audio splitting: by duration, silence detection, or file size with smart segmentation
- Confidence analysis and quality assessment with detailed statistics
- Session management with progress tracking, resume capability, and persistence
- Performance monitoring with GPU benchmarking, memory usage, and operation tracking
- Webhook integration for external notifications and workflow automation
- Profile management for different use cases and quick configuration switching
- Vocabulary management for improved recognition of technical terms and proper nouns
- Post-processing pipeline with spell check, grammar correction, and custom rules
- API server for integration with external applications and services
- Comprehensive testing with E2E test suites for all major functionality
VoiceBridge uses uv for fast dependency management. Install uv first if you don't have it:
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install with uv
uv pip install voicebridge# Listen for speech and transcribe with hotkeys
voicebridge stt listen
# Transcribe an audio file
voicebridge stt transcribe audio.mp3 --output transcript.txt
# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"
# Start clipboard monitoring for TTS
voicebridge tts listen-clipboard# Transcribe a podcast episode with timestamps
voicebridge stt transcribe podcast_episode.mp3 \
--format srt \
--output episode_subtitles.srt \
--language en
# Analyze transcription quality
voicebridge stt confidence analyze session_12345 --detailed# Convert article to speech with custom voice
voicebridge tts generate \
--voice en-Alice_woman \
--output article_audio.wav \
"$(cat article.txt)"
# Batch convert multiple documents
voicebridge stt batch-transcribe articles/ \
--output-dir transcripts/ \
--workers 4# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard
# Set up webhook notifications
voicebridge stt webhook add https://api.example.com/transcription-complete
# Real-time transcription with streaming
voicebridge stt realtime \
--chunk-duration 2.0 \
--output-format live# Process interview recordings with resumable capability
voicebridge stt listen-resumable interview.wav \
--session-name "interview-2024-01-15" \
--language en
# Export results in multiple formats
voicebridge stt export session session_12345 \
--format json \
--include-confidence \
--output transcript.json- Python 3.10+
- uv (Python package manager)
- FFmpeg (for audio processing)
- CUDA (optional, for GPU acceleration)
# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and setup
git clone https://github.com/yourusername/voicebridge.git
cd voicebridge
# 3. Choose your setup:
make prepare # CPU version
make prepare-cuda # With CUDA support
make prepare-tray # With system tray support
# 4. Install system dependencies
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows (with Chocolatey):
choco install ffmpegVoiceBridge includes comprehensive text-to-speech capabilities powered by VibeVoice.
-
Install VibeVoice dependencies (if using local model):
# Clone and install VibeVoice git clone https://github.com/WestZhang/VibeVoice.git cd VibeVoice pip install -e .
-
Voice Samples: Voice samples are included in
voices/directory:voices/ ├── en-Alice_woman.wav ├── en-Carter_man.wav ├── en-Frank_man.wav ├── en-Maya_woman.wav ├── en-Patrick.wav └── ... (additional voices)
VoiceBridge works out-of-the-box with sensible defaults. Configuration can be set via:
-
Config file (
~/.config/voicebridge/config.json):{ "tts_enabled": true, "tts_config": { "model_path": "aoi-ot/VibeVoice-7B", "voice_samples_dir": "voices", "default_voice": "en-Alice_woman", "cfg_scale": 1.3, "inference_steps": 10, "tts_mode": "clipboard", "streaming_mode": "non_streaming", "output_mode": "play", "tts_toggle_key": "f11", "tts_generate_key": "f12", "tts_stop_key": "ctrl+alt+s", "sample_rate": 24000, "auto_play": true, "use_gpu": true, "max_text_length": 2000, "chunk_text_threshold": 500 } } -
Command-line flags (override config file):
# Generate with custom settings voicebridge tts generate "Hello world" \ --voice en-Patrick \ --streaming \ --output speech.wav \ --cfg-scale 1.5 \ --inference-steps 15
- Format: WAV (recommended), MP3, FLAC
- Sample Rate: 24kHz (recommended), 16kHz-48kHz supported
- Channels: Mono (preferred)
- Duration: 3-10 seconds
- Quality: Clear, single speaker, minimal background noise
- Naming:
language-name_gender.wav(e.g.,en-Alice_woman.wav)
# Test TTS with default settings
voicebridge tts generate "Hello, this is VoiceBridge text-to-speech!"
# List available voices
voicebridge tts voices
# Show current TTS configuration
voicebridge tts config showmake help # Show all available commands
make lint # Run ruff linting and formatting
make test # Run all tests with coverage
make test-fast # Quick tests without coverage
make test-unit # Run only unit tests (exclude e2e)
make test-e2e # Run comprehensive end-to-end tests
make test-e2e-smoke # Run quick E2E smoke tests
make test-e2e-stt # Run STT E2E tests only
make test-e2e-tts # Run TTS E2E tests only
make test-e2e-audio # Run audio E2E tests only
make test-e2e-gpu # Run GPU E2E tests only
make test-e2e-api # Run API E2E tests only
make clean # Clean cache and temporary files# Show current STT configuration
voicebridge stt config show
# Set STT configuration values
voicebridge stt config set use_gpu true
# Show TTS configuration
voicebridge tts config show
# Set up profiles for different use cases
voicebridge stt profile save research-setup
voicebridge stt profile load research-setup# Listen with hotkeys (F9 to start/stop)
voicebridge stt listen
# Interactive mode (press 'r' to record)
voicebridge stt interactive
# Global hotkey listener with custom key
voicebridge stt hotkey --key f9 --mode toggle# Transcribe single file
voicebridge stt transcribe audio.mp3 --output transcript.txt
# Batch process directory
voicebridge stt batch-transcribe /path/to/audio/ --workers 4
# Long file with resume capability
voicebridge stt listen-resumable large_file.wav --session-name "my-session"
# Real-time streaming
voicebridge stt realtime --chunk-duration 2.0 --output-format live# List all sessions
voicebridge stt sessions list
# Resume interrupted session
voicebridge stt sessions resume --session-name "my-session"
# Clean up old sessions
voicebridge stt sessions cleanup
# Delete specific session
voicebridge stt sessions delete session_id# Add vocabulary words for better recognition
voicebridge stt vocabulary add "technical,terms,here" --type technical
# Export with confidence analysis
voicebridge stt export session session_id --format srt --confidence
# Set up webhooks for notifications
voicebridge stt webhook add https://api.example.com/notify# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"
# Use specific voice and save to file
voicebridge tts generate "Hello world" --voice en-Alice_woman --output speech.wav
# Generate speech from a text file
voicebridge tts generate-file document.txt --output document.wav
voicebridge tts generate-file article.md --voice en-Patrick --streaming
# List available voices
voicebridge tts voices# Monitor clipboard for text changes
voicebridge tts listen-clipboard --streaming
# Monitor text selections (use hotkey to trigger)
voicebridge tts listen-selection
# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard
voicebridge tts daemon status
voicebridge tts daemon stop# Show TTS settings
voicebridge tts config show
# Configure TTS settings
voicebridge tts config set --default-voice en-Alice_woman --cfg-scale 1.5# Get audio file information
voicebridge audio info audio.mp3
# List supported formats
voicebridge audio formats
# Split large audio file
voicebridge audio split recording.mp3 \
--method duration \
--chunk-duration 300
# Enhance audio quality
voicebridge audio preprocess input.wav output.wav \
--noise-reduction 0.8 \
--normalize \
--trim-silence
# Test audio setup
voicebridge audio test# Check GPU status and acceleration
voicebridge gpu status
# Benchmark GPU performance
voicebridge gpu benchmark --model base
# View STT performance statistics
voicebridge stt performance stats
# Manage active operations
voicebridge stt operations list
voicebridge stt operations cancel operation_id# Start API server
voicebridge api start --host localhost --port 8000
# Check API status
voicebridge api status
# Get API information
voicebridge api info
# Stop API server
voicebridge api stopVoiceBridge uses a hierarchical command structure with five main categories:
stt listen # Real-time transcription with hotkeys
stt interactive # Press-and-hold 'r' to record mode
stt hotkey # Global hotkey listener
stt transcribe # Transcribe single audio file
stt batch-transcribe # Batch process directory
stt listen-resumable # Long file with resume capability
stt realtime # Real-time streaming transcription
# Session Management
stt sessions list # List all sessions
stt sessions resume # Resume interrupted session
stt sessions cleanup # Clean up old sessions
stt sessions delete # Delete specific session
# Advanced Features
stt vocabulary add # Add custom vocabulary
stt vocabulary remove # Remove vocabulary
stt vocabulary list # List vocabulary
stt vocabulary import # Import from file
stt vocabulary export # Export to file
stt export session # Export session data
stt export formats # List export formats
stt confidence analyze # Analyze transcription confidence
stt confidence analyze-all # Analyze all sessions
stt postproc config # Configure post-processing
stt postproc test # Test post-processing
stt webhook add # Add webhook notification
stt webhook remove # Remove webhook
stt webhook list # List webhooks
stt webhook test # Test webhook
stt performance stats # Performance statistics
stt operations list # List active operations
stt operations cancel # Cancel operation
stt operations status # Check operation status
stt config show # Show configuration
stt config set # Set configuration
stt profile save # Save configuration profile
stt profile load # Load configuration profile
stt profile list # List profiles
stt profile delete # Delete profile
tts generate # Generate speech from text
tts generate-file # Generate speech from text file (txt, md, etc.)
tts listen-clipboard # Monitor clipboard changes
tts listen-selection # Monitor text selections with hotkey
tts voices # List available voices
# Daemon Management
tts daemon start # Start TTS daemon
tts daemon stop # Stop TTS daemon
tts daemon status # Check daemon status
# Configuration
tts config show # Show TTS configuration
tts config set # Configure TTS settings
audio info # Show audio file information
audio formats # List supported formats
audio split # Split audio file into chunks
audio preprocess # Enhance audio quality
audio test # Test audio setup
gpu status # Show GPU status
gpu benchmark # Benchmark GPU performance
api start # Start API server
api stop # Stop API server
api status # Check API status
api info # Show API information
VoiceBridge follows hexagonal architecture principles:
voicebridge/
├── domain/ # Core business logic and models
├── ports/ # Interfaces and abstractions
├── adapters/ # External integrations (Whisper, VibeVoice, etc.)
├── services/ # Application services and orchestration
├── cli/ # Command-line interface
└── tests/ # Comprehensive test suite
- Domain Layer: Core models, configurations, and business rules
- Ports: Abstract interfaces for transcription, TTS, audio processing
- Adapters: Concrete implementations for Whisper, VibeVoice, FFmpeg
- Services: Orchestration, session management, performance monitoring
- CLI: Typer-based command interface with sub-commands
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Install development dependencies:
make install-dev - Make your changes following our coding standards
- Test your changes:
make test - Lint your code:
make lint - Commit your changes:
git commit -m 'Add amazing feature' - Push to your branch:
git push origin feature/amazing-feature - Open a Pull Request
- Python 3.10+ with comprehensive type hints
- uv for fast dependency management and virtual environments
- Ruff for linting and formatting (replaces Black and isort)
- Pytest for testing with >90% coverage target
- Hexagonal architecture for new features and clean separation of concerns
- Comprehensive documentation for public APIs and CLI commands
- E2E testing for all major CLI workflows and functionality
- Makefile for standardized development commands
- 🎯 New audio formats and processing capabilities
- 🌍 Language support and localization
- 🔧 Performance optimizations and GPU utilization
- 📱 Platform integrations (mobile, web interfaces)
- 🧪 Test coverage and edge case handling
- 📚 Documentation and usage examples
- 🎨 Voice samples and TTS improvements
Please use our issue templates:
- 🐛 Bug Report: Describe the issue with reproduction steps
- 💡 Feature Request: Propose new functionality
- 📚 Documentation: Report unclear or missing docs
- 🏃 Performance: Report slow or resource-intensive operations
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper - State-of-the-art speech recognition
- VibeVoice - High-quality text-to-speech synthesis
- FFmpeg - Comprehensive audio processing
- Typer - Modern CLI framework
- PyTorch - Machine learning infrastructure