collections of realtime ai resources focusing audio and video, including: api, client, projects, models, etc.
provider | api reference | modality | main feature |
---|---|---|---|
openai | Realtime API | text/audio/video | gpt-4o. |
minimax | Realtime API | text/audio | compatible with openai api. |
Vapi | API Reference | text/audio | Vapi lets developers build, test, & deploy voice AI agents in minutes rather than months — solving for the foundational challenges voice AI applications face: 1. Simulating the Flow of Natural Human Conversation. 2. Realtime/Low Latency Demands. 3. Taking Actions (Function Calling). 4. Extracting Conversation Data (Review conversation audio, transcripts, & metadata.). SDK: https://docs.vapi.ai/introduction#explore-our-sdks |
hume.ai | API Reference | text/audio | Real-time, customizable voice intelligence powered by empathic AI. |
API Reference | text/audio/video | gemini multimodal live api | |
retell.ai | API Reference | text/audio | Retell is a comprehensive platform for building, testing, deploying, and monitoring reliable AI phone agents . |
doubao | volcengine | text/audio | ASR + LLM + TTS, WebRTC. |
elevenlabs | conversational ai | text/audio | ASR + LLM + TTS, WebSocket. |
aliyun | realtime ai | text/audio | aliyun realtime ai. |
zhipuAI | GLM-Realtime | text/audio/video | ZhiPuAI realtime. |
SenseTime | SenseNova-5o | text/audio/video | - |
maintainer | link | main feature |
---|---|---|
@jhakulin | realtime-ai | Experimental Python SDK for OpenAI's Realtime API, Well Structured. |
@Pipecat | Pipecat | Pipecat is an open source Python framework for building voice and multimodal conversational agents. It handles the complex orchestration of AI services, network transport, audio processing, and multimodal interactions, letting you focus on creating engaging experiences. |
@TEN-framework | TEN Framework | TEN stands for Transformative Extensions Network , is a voice agent framework to create conversational AI. The TEN framework offers the following advantages: 1. Native Support for High-Performance, Real-Time Multimodal Interactions. 2. Supports Multiple Languages and Platforms. 3. Edge-Cloud Integration. 4. Flexibility Beyond Model Limitations. 5. Real-Time Agent State Management. 6. And more... |
contributor | model | main feature |
---|---|---|
FDU | SpeechGPT2 | SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions. |
ZhipuAI | GLM-4-Voice | GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions. |
kyutai-labs | Moshi | Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. |
tencent | Freeze-Omni | A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM. |
tencent | VITA | open-source interactive omni-multimodal LLM. |
InternLM | Intern-OmniLive | a comprehensive multimodal system for long-term streaming video and audio interactions. |
THU | mini-omni / mini-omni2 | only English; language model: Qwen2-0.5B. |
ICT/CAS | LLaMA-Omni | LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level. |
OpenBMB | MiniCPM-o | MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone |
SJTU | SLAM-Omni | Qwen2-0.5B, Chinese/English, SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. |
Fixie | Ultravox | Building on research likeAudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. |