AwesomeRealtimeAI

collections of realtime ai resources focusing audio and video, including: api, client, projects, models, etc.

API

provider	api reference	modality	main feature
openai	Realtime API	text/audio/video	gpt-4o.
minimax	Realtime API	text/audio	compatible with openai api.
Vapi	API Reference	text/audio	Vapi lets developers build, test, & deploy voice AI agents in minutes rather than months — solving for the foundational challenges voice AI applications face: 1. Simulating the Flow of Natural Human Conversation. 2. Realtime/Low Latency Demands. 3. Taking Actions (Function Calling). 4. Extracting Conversation Data (Review conversation audio, transcripts, & metadata.). SDK: https://docs.vapi.ai/introduction#explore-our-sdks
hume.ai	API Reference	text/audio	Real-time, customizable voice intelligence powered by empathic AI.
Google	API Reference	text/audio/video	gemini multimodal live api
retell.ai	API Reference	text/audio	Retell is a comprehensive platform for building, testing, deploying, and monitoring reliable AI phone agents .
doubao	volcengine	text/audio	ASR + LLM + TTS, WebRTC.
elevenlabs	conversational ai	text/audio	ASR + LLM + TTS, WebSocket.
aliyun	realtime ai	text/audio	aliyun realtime ai.
zhipuAI	GLM-Realtime	text/audio/video	ZhiPuAI realtime.
SenseTime	SenseNova-5o	text/audio/video	-

Projects

maintainer	link	main feature
@jhakulin	realtime-ai	Experimental Python SDK for OpenAI's Realtime API, Well Structured.
@Pipecat	Pipecat	Pipecat is an open source Python framework for building voice and multimodal conversational agents. It handles the complex orchestration of AI services, network transport, audio processing, and multimodal interactions, letting you focus on creating engaging experiences.
@TEN-framework	TEN Framework	TEN stands for Transformative Extensions Network , is a voice agent framework to create conversational AI. The TEN framework offers the following advantages: 1. Native Support for High-Performance, Real-Time Multimodal Interactions. 2. Supports Multiple Languages and Platforms. 3. Edge-Cloud Integration. 4. Flexibility Beyond Model Limitations. 5. Real-Time Agent State Management. 6. And more...

Models(Open Source)

contributor	model	main feature
FDU	SpeechGPT2	SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions.
ZhipuAI	GLM-4-Voice	GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.
kyutai-labs	Moshi	Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
tencent	Freeze-Omni	A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM.
tencent	VITA	open-source interactive omni-multimodal LLM.
InternLM	Intern-OmniLive	a comprehensive multimodal system for long-term streaming video and audio interactions.
THU	mini-omni / mini-omni2	only English; language model: Qwen2-0.5B.
ICT/CAS	LLaMA-Omni	LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
OpenBMB	MiniCPM-o	MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
SJTU	SLAM-Omni	Qwen2-0.5B, Chinese/English, SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks.
Fixie	Ultravox	Building on research likeAudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AwesomeRealtimeAI

API

Projects

Models(Open Source)

About

Releases

Packages

SunLemuria/AwesomeRealtimeAI

Folders and files

Latest commit

History

Repository files navigation

AwesomeRealtimeAI

API

Projects

Models(Open Source)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages