Skip to content

collections of realtime ai resources focusing audio and video, including: api, client, projects, models, etc.

Notifications You must be signed in to change notification settings

SunLemuria/AwesomeRealtimeAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

AwesomeRealtimeAI

collections of realtime ai resources focusing audio and video, including: api, client, projects, models, etc.

API

provider api reference modality main feature
openai Realtime API text/audio/video gpt-4o.
minimax Realtime API text/audio compatible with openai api.
Vapi API Reference text/audio Vapi lets developers build, test, & deploy voice AI agents in minutes rather than months — solving for the foundational challenges voice AI applications face:
1. Simulating the Flow of Natural Human Conversation.
2. Realtime/Low Latency Demands.
3. Taking Actions (Function Calling).
4. Extracting Conversation Data (Review conversation audio, transcripts, & metadata.).

SDK: https://docs.vapi.ai/introduction#explore-our-sdks
hume.ai API Reference text/audio Real-time, customizable voice intelligence powered by empathic AI.
Google API Reference text/audio/video gemini multimodal live api
retell.ai API Reference text/audio Retell is a comprehensive platform for building, testing, deploying, and monitoring reliable AI phone agents .
doubao volcengine text/audio ASR + LLM + TTS, WebRTC.
elevenlabs conversational ai text/audio ASR + LLM + TTS, WebSocket.
aliyun realtime ai text/audio aliyun realtime ai.
zhipuAI GLM-Realtime text/audio/video ZhiPuAI realtime.
SenseTime SenseNova-5o text/audio/video -

Projects

maintainer link main feature
@jhakulin realtime-ai Experimental Python SDK for OpenAI's Realtime API, Well Structured.
@Pipecat Pipecat Pipecat is an open source Python framework for building voice and multimodal conversational agents. It handles the complex orchestration of AI services, network transport, audio processing, and multimodal interactions, letting you focus on creating engaging experiences.
@TEN-framework TEN Framework TEN stands for Transformative Extensions Network , is a voice agent framework to create conversational AI. The TEN framework offers the following advantages:
1. Native Support for High-Performance, Real-Time Multimodal Interactions.
2. Supports Multiple Languages and Platforms.
3. Edge-Cloud Integration.
4. Flexibility Beyond Model Limitations.
5. Real-Time Agent State Management.
6. And more...

Models(Open Source)

contributor model main feature
FDU SpeechGPT2 SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions.
ZhipuAI GLM-4-Voice GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.
kyutai-labs Moshi Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
tencent Freeze-Omni A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM.
tencent VITA open-source interactive omni-multimodal LLM.
InternLM Intern-OmniLive a comprehensive multimodal system for long-term streaming video and audio interactions.
THU mini-omni / mini-omni2 only English; language model: Qwen2-0.5B.
ICT/CAS LLaMA-Omni LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
OpenBMB MiniCPM-o MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
SJTU SLAM-Omni Qwen2-0.5B, Chinese/English, SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks.
Fixie Ultravox Building on research likeAudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma.

About

collections of realtime ai resources focusing audio and video, including: api, client, projects, models, etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published