Real-Time Transcription with Speaker Identification and Streaming Confidence Scores
This repository is a significantly enhanced fork of the original whisper_streaming project, featuring:
- Real-time speaker diarization using Resemblyzer embeddings and Silero VAD
- Streaming word-level confidence probabilities for quality assessment
- Modular architecture with pluggable ASR backends
- Development container environment with GPU support and VS Code integration
- Optimized performance for real-time multi-speaker scenarios
- Real-time speaker identification during streaming transcription
- Automatic speaker detection with similarity-based clustering
- Voice activity detection using Silero VAD models
- Speaker persistence across audio segments
- Word-level confidence scores for each transcribed word
- Real-time quality assessment for streaming applications
- Probability aggregation at sentence and utterance levels
- Quality-based filtering for improved accuracy
- Modular ASR backends: faster-whisper, whisper-timestamped, OpenAI API, MLX-whisper
- Containerized development with full GPU support
- Flexible audio processing with VAD and voice activity control
- Streaming-optimized buffering with local agreement policies
This enhanced streaming system is built with a modular architecture centered around these core components:
online_processor.py
- Main streaming processor with probability trackingstreaming_diarizer.py
- Real-time speaker diarization enginehypothesis_buffer.py
- Advanced buffering with probability aggregationimpl/
- Pluggable ASR backends (faster-whisper, OpenAI API, MLX, etc.)server.py
- TCP server for real-time streamingvac_processor.py
- Voice Activity Controller integration
The system now provides real-time confidence assessment:
- Word-level probabilities from Whisper model outputs
- Sentence-level aggregation using probability averaging
- Quality filtering based on confidence thresholds
- Real-time feedback for streaming applications
Advanced speaker identification using modern ML models:
- Resemblyzer embeddings for speaker characterization
- Silero VAD for voice activity detection
- Similarity clustering for automatic speaker identification
- Streaming-optimized processing with minimal latency
The diarizer automatically:
- Detects voice activity in real-time
- Extracts speaker embeddings from speech segments
- Clusters similar voices into speaker profiles
- Assigns speaker labels to transcribed text
# Example output structure:
(start_time, end_time, text, avg_probability, [SPEAKER_<id>] [(word, prob), ...])
Multiple ASR backends supported with consistent interfaces:
- faster-whisper (recommended) - GPU-optimized, 4x faster
- whisper-timestamped - Enhanced timestamp accuracy
- OpenAI API - Cloud-based processing
- MLX-whisper - Apple Silicon optimization
from asr.utils import create_asr_engine, add_shared_args
from asr.online_processor import OnlineASRProcessor
# Initialize ASR with probability tracking
asr = create_asr_engine("faster-whisper", language="en", model="large-v3")
processor = OnlineASRProcessor(asr)
# Process audio chunks
while audio_available:
audio_chunk = get_audio_chunk()
processor.insert_audio_chunk(audio_chunk)
# Get transcription with probabilities
result = processor.process_iter()
if result[0] is not None: # New confirmed text
start, end, text, avg_prob, word_probs = result
print(f"[{start:.1f}-{end:.1f}] {text} (confidence: {avg_prob:.2f})")
from asr.streaming_diarizer import StreamingDiarizer
# Initialize diarization (requires resemblyzer and silero-vad)
diarizer = StreamingDiarizer()
# Process with speaker identification
while audio_available:
audio_chunk = get_audio_chunk()
# Add audio to both processor and diarizer
processor.insert_audio_chunk(audio_chunk)
diarizer.add_audio_chunk(audio_chunk)
# Get transcription
result = processor.process_iter()
# Get speaker assignments
speaker_assignments = diarizer.process_chunk()
if result[0] is not None:
start, end, text, avg_prob, word_probs = result
# Find speaker for this timestamp
speaker = diarizer.get_speaker_for_timestamp((start + end) / 2)
print(f"[{speaker or 'UNKNOWN'}] {text} (confidence: {avg_prob:.2f})")
This project provides a complete containerized development environment with GPU acceleration support for efficient development and testing.
- Docker with NVIDIA container runtime (for GPU support)
- VS Code with Dev Containers extension
- NVIDIA GPU (recommended) with drivers ≥470.57.02
- Git with LFS support
-
Clone the repository:
git clone https://github.com/your-username/whisper_streaming.git cd whisper_streaming
-
Open in VS Code:
code .
-
Reopen in Container:
- Press
Ctrl+Shift+P
(orCmd+Shift+P
on Mac) - Select "Dev Containers: Reopen in Container"
- VS Code will build and start the development container
- Press
The development container includes:
- Base: Ubuntu 22.04 with CUDA 12.3.2 and cuDNN 9
- Python: Latest Python 3 with optimized package installations
- GPU Support: Full NVIDIA GPU passthrough for acceleration
- Extensions: Pre-configured Python development tools
- Port Forwarding: Automatic forwarding for streaming server (port 43007)
GPU Access: The container runs with --gpus all
to access all available GPUs:
{
"runArgs": ["--privileged", "--gpus", "all"]
}
Python Environment: All dependencies pre-installed via requirements.txt
:
- Core ML libraries:
torch
,torchaudio
,faster-whisper
- Audio processing:
librosa
,soundfile
- Diarization models:
resemblyzer
,silero-vad
- Text processing:
opus-fast-mosestokenizer
,wtpsplit
VS Code Integration: Pre-configured extensions for Python development:
- Python language support with IntelliSense
- Task runner for custom commands
- Markdown editing capabilities
Start the streaming server in the container:
# Basic server with default settings
python asr/server.py --port 43007 --model large-v3
# With diarization enabled
python asr/server.py --port 43007 --model large-v3 --diarization
# With custom settings
python asr/server.py \
--port 43007 \
--model large-v3-turbo \
--language en \
--vad \
--buffer_trimming_sec 2
From the host machine or another container:
# Stream audio file to server
python client/client.py --server localhost:43007 --file assets/jfk.flac
# Real-time microphone streaming (requires audio setup)
arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43007
asr/ # Core ASR processing modules
├── base.py # ASR backend interface
├── online_processor.py # Main streaming processor
├── streaming_diarizer.py # Speaker diarization
├── hypothesis_buffer.py # Streaming buffer management
├── impl/ # ASR backend implementations
│ ├── faster_whisper.py
│ ├── openai_api.py
│ └── mlx_whisper.py
└── server.py # TCP streaming server
client/ # Client implementations
└── client.py # Example streaming client
common/ # Shared utilities
└── line_packet.py # Network packet handling
- Create new implementation in
asr/impl/your_backend.py
- Inherit from
ASRBase
inasr/base.py
- Implement required methods:
load_model()
,transcribe()
,ts_words()
- Register backend in
asr/utils.py
# Run with test audio file
python -c "
from asr.utils import create_asr_engine
from asr.online_processor import OnlineASRProcessor
import numpy as np
# Test basic functionality
asr = create_asr_engine('faster-whisper', language='en', model='base')
processor = OnlineASRProcessor(asr)
print('Setup successful!')
"
# Test with sample audio
python client/client.py --file assets/jfk.flac --simulate-realtime
For development with limited GPU memory:
# Use smaller models
export WHISPER_MODEL=base # Instead of large-v3
# Monitor GPU usage
nvidia-smi -l 1
# Or use CPU-only mode for debugging
export CUDA_VISIBLE_DEVICES=""
Enable detailed logging:
# Set log level for debugging
export LOG_LEVEL=DEBUG
# Or configure in code
import logging
logging.getLogger("whisper_streaming").setLevel(logging.DEBUG)
Common debug scenarios:
- Model loading issues: Check CUDA version compatibility
- Audio format problems: Ensure 16kHz mono input
- Diarization failures: Verify resemblyzer/silero-vad installation
- Network issues: Check port forwarding and firewall settings
# Optimize for RTX series GPUs
export WHISPER_COMPUTE_TYPE=float16
export WHISPER_DEVICE=cuda
# For older GPUs or stability issues
export WHISPER_COMPUTE_TYPE=int8_float16
# Reduce buffer sizes for lower latency
export BUFFER_TRIMMING_SEC=1
# Increase for better accuracy
export BUFFER_TRIMMING_SEC=5
This enhanced system provides rich output with probabilities and speaker information:
Each line contains:
<emission_time> <start_ms> <end_ms> [<speaker>] <text> (confidence: <probability>)
Example Output:
2691.44 300 1380 [SPEAKER_00] Chairman, thank you. (confidence: 0.94)
6914.55 1940 4940 [SPEAKER_01] If the debate today had a (confidence: 0.87)
9019.03 5160 7160 [SPEAKER_01] the subject the situation in (confidence: 0.91)
10065.13 7180 7480 [SPEAKER_01] Gaza (confidence: 0.95)
11058.36 7480 9460 [SPEAKER_02] Strip, I might (confidence: 0.89)
When using the Python API, responses include detailed probability information:
# process_iter() returns:
(start_time, end_time, text, avg_probability, word_probabilities)
# Example:
(2691.44, 1380, "Chairman, thank you.", 0.94,
[("Chairman,", 0.96), ("thank", 0.93), ("you.", 0.93)])
When speaker diarization is enabled:
from asr.streaming_diarizer import StreamingDiarizer
diarizer = StreamingDiarizer()
# ... process audio ...
# Get speaker assignments
speaker_assignments = diarizer.process_chunk()
# Returns: {(start_time, end_time): "SPEAKER_ID", ...}
# Get speaker for specific timestamp
speaker = diarizer.get_speaker_for_timestamp(timestamp)
# Returns: "SPEAKER_00", "SPEAKER_01", etc., or None
# Get overall statistics
stats = diarizer.get_speaker_stats()
# Returns: {
# 'total_speakers': 3,
# 'speaker_names': ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02'],
# 'total_processed_time': 125.3,
# 'models_loaded': True
# }
- Word-level probabilities (0.0-1.0): Individual word confidence from Whisper
- Average probabilities (0.0-1.0): Sentence/utterance average for quality assessment
- Thresholds for filtering:
> 0.9
: High confidence, likely accurate0.7-0.9
: Good confidence, generally reliable0.5-0.7
: Moderate confidence, may need review< 0.5
: Low confidence, likely errors
For structured data applications:
import json
# Enable JSON output in server mode
python asr/server.py --output-format json
# Client receives:
{
"timestamp": 2691.44,
"start_ms": 300,
"end_ms": 1380,
"text": "Chairman, thank you.",
"speaker": "SPEAKER_00",
"confidence": 0.94,
"word_probabilities": [
{"word": "Chairman,", "probability": 0.96},
{"word": "thank", "probability": 0.93},
{"word": "you.", "probability": 0.93}
]
}
This enhanced system builds upon the foundational streaming Whisper architecture while adding significant new capabilities for real-world applications.
The base streaming approach addresses Whisper's original 30-second chunk limitation through:
- Local Agreement Policy: Consecutive updates must agree on transcript prefixes before confirmation
- Dynamic Buffer Management: Smart audio buffer trimming based on sentence/segment boundaries
- Init Prompt Handling: Proper context management for continuous processing
- Timestamp Synchronization: Accurate alignment between audio and text outputs
- Word-level confidence tracking from Whisper model outputs
- Real-time quality assessment for streaming applications
- Probability aggregation for sentence and utterance-level confidence
- Quality-based filtering to improve transcription reliability
- Resemblyzer embeddings for robust speaker characterization
- Silero VAD integration for precise voice activity detection
- Streaming-optimized clustering with minimal latency impact
- Speaker persistence across audio segments and sessions
- Pluggable ASR backends supporting multiple Whisper implementations
- Containerized development with full GPU acceleration
- Enhanced buffering with probability-aware processing
- Network-optimized streaming for real-time applications
The system processes audio through these key stages:
- Audio Buffering: Accumulate chunks with Voice Activity Detection
- ASR Processing: Extract transcriptions with word-level probabilities
- Speaker Analysis: Generate embeddings and assign speaker identities
- Hypothesis Management: Buffer and confirm transcripts using local agreement
- Output Generation: Combine text, timing, speakers, and confidence scores
This approach achieves low-latency streaming while maintaining high accuracy and providing rich metadata for downstream applications.
Based on the original research and enhancements:
- Latency: ~3.3 seconds for high-quality transcription
- Accuracy: Maintains Whisper model quality with streaming optimizations
- Speaker Identification: Real-time diarization with minimal overhead
- Scalability: Efficient GPU utilization for multiple concurrent streams
This work builds upon significant contributions from the research and open-source communities:
- Dominik Macháček, Raj Dabre, Ondřej Bojar for the foundational streaming Whisper research
- Paper: "Turning Whisper into Real-Time Transcription System" (IJCNLP-AACL 2023)
- Peter Polák for the original streaming demo concept
- UEDIN team of the ELITR project for the original
line_packet.py
- Silero Team for their VAD model and VADIterator implementation
- Original whisper_streaming contributors for the foundational codebase
- Resemblyzer team for speaker embedding technology
- faster-whisper developers for optimized inference engines
- OpenAI for the Whisper model family
This work builds upon the foundational research by Macháček et al. (2023) on real-time Whisper streaming:
Paper PDF | Demo video | Slides
For the original research, please cite:
@inproceedings{machacek-etal-2023-turning,
title = "Turning Whisper into Real-Time Transcription System",
author = "Mach{\'a}{\v{c}}ek, Dominik and Dabre, Raj and Bojar, Ond{\v{r}}ej",
booktitle = "Proceedings of IJCNLP-AACL 2023: System Demonstrations",
year = "2023",
url = "https://aclanthology.org/2023.ijcnlp-demo.3",
pages = "17--24"
}
For questions about this enhanced implementation:
- Issues: Please use GitHub Issues for bug reports and feature requests
- Discussions: Use GitHub Discussions for general questions and usage help
- Original Research: Contact Dominik Macháček ([email protected]) for research-related questions