NOTE: This is a very early developer preview!
An open source toolkit for building voice assistants.
Rhasspy focuses on:
- Privacy - no data leaves your computer unless you want it to
- Broad language support - more than just English
- Customization - everything can be changed
- Check out the tutorial
- Connect Rhasspy to Home Assistant
- Install the Rhasspy 3 add-on
- Run one or more satellites
- Join the community
This is a developer preview, so there are lots of things missing:
- A user friendly web UI
- An automated method for installing programs/services and downloading models
- Support for custom speech to text grammars
- Intent systems besides Home Assistant
- The ability to accumulate context within a pipeline
Rhasspy is organized by domain:
- mic - audio input
- wake - wake word detection
- asr - speech to text
- vad - voice activity detection
- intent - intent recognition from text
- handle - intent or text input handling
- tts - text to speech
- snd - audio output
Rhasspy talks to external programs using the Wyoming protocol. You can add your own programs by implementing the protocol or using an adapter.
Small scripts that live in bin/
and bridge existing programs into the Wyoming protocol.
For example, a speech to text program (asr
) that accepts a WAV file and outputs text can use asr_adapter_wav2text.py
Complete voice loop from microphone input (mic) to speaker output (snd). Stages are:
- detect (optional)
- Wait until wake word is detected in mic
- transcribe
- Listen until vad detects silence, then convert audio to text
- recognize (optional)
- Recognize an intent from text
- handle
- Handle an intent or text, producing a text response
- speak
- Convert handle output text to speech, and speak through snd
Some programs take a while to load, so it's best to leave them running as a server. Use bin/server_run.py
or add --server <domain> <name>
when running the HTTP server.
See servers
section of configuration.yaml
file.
- mic
- wake
- vad
- asr
- handle
- tts
- snd
http://localhost:13331/<endpoint>
Unless overridden, the pipeline named "default" is used.
/pipeline/run
- Runs a full pipeline from mic to snd
- Produces JSON
- Override
pipeline
or:wake_program
asr_program
intent_program
handle_program
tts_program
snd_program
- Skip stages with
start_after
wake
- skip detection, body is detection name (text)asr
- skip recording, body is transcript (text) or WAV audiointent
- skip recognition, body is intent/not-recognized event (JSON)handle
- skip handling, body is handle/not-handled event (JSON)tts
- skip synthesis, body is WAV audio
- Stop early with
stop_after
wake
- only detectionasr
- detection and transcriptionintent
- detection, transcription, recognitionhandle
- detection, transcription, recognition, handlingtts
- detection, transcription, recognition, handling, synthesis
/wake/detect
- Detect wake word in WAV input
- Produces JSON
- Override
wake_program
orpipeline
/asr/transcribe
- Transcribe audio from WAV input
- Produces JSON
- Override
asr_program
orpipeline
/intent/recognize
- Recognizes intent from text body (POST) or
text
(GET) - Produces JSON
- Override
intent_program
orpipeline
- Recognizes intent from text body (POST) or
/handle/handle
- Handles intent/text from body (POST) or
input
(GET) Content-Type
must beapplication/json
for intent input- Override
handle_program
orpipeline
- Handles intent/text from body (POST) or
/tts/synthesize
- Synthesizes audio from text body (POST) or
text
(GET) - Produces WAV audio
- Override
tts_program
orpipeline
- Synthesizes audio from text body (POST) or
/tts/speak
- Plays audio from text body (POST) or
text
(GET) - Produces JSON
- Override
tts_program
,snd_program
, orpipeline
- Plays audio from text body (POST) or
/snd/play
- Plays WAV audio via snd
- Override
snd_program
orpipeline
/config
- Returns JSON config
/version
- Returns version info
ws://localhost:13331/<endpoint>
Audio streams are raw PCM in binary messages.
Use the rate
, width
, and channels
parameters for sample rate (hertz), width (bytes), and channel count. By default, input audio is 16Khz 16-bit mono, and output audio is 22Khz 16-bit mono.
The client can "end" the audio stream by sending an empty binary message.
/pipeline/asr-tts
- Run pipeline from asr (stream in) to tts (stream out)
- Produces JSON messages as events happen
- Override
pipeline
or:asr_program
vad_program
handle_program
tts_program
- Use
in_rate
,in_width
,in_channels
for audio input format - Use
out_rate
,out_width
,out_channels
for audio output format
/wake/detect
- Detect wake word from websocket audio stream
- Produces a JSON message when audio stream ends
- Override
wake_program
orpipeline
/asr/transcribe
- Transcribe a websocket audio stream
- Produces a JSON message when audio stream ends
- Override
asr_program
orpipeline
/snd/play
- Play a websocket audio stream
- Produces a JSON message when audio stream ends
- Override
snd_program
orpipeline