speech is a modular service that provides text-to-speech (TTS) and speech-to-text (STT) capabilities for machines running on the Viam platform.
This module implements the Speech Service API (viam-labs:service:speech
). See the documentation for that API to learn more about using it with the Viam SDKs.
On Linux:
run.sh
will automatically install the following system dependencies if not already set up on the machine:
python3-pyaudio
portaudio19-dev
alsa-tools
alsa-utils
flac
On MacOS, run.sh
will install the following dependencies using Homebrew before adding the modular resource:
brew install portaudio
Before configuring your speech service, you must also create a machine.
To use this module, follow these instructions to add a module from the Viam Registry and select the viam-labs:speech:speechio
model from the speech
module.
Navigate to the Config tab of your machine's page in the Viam app.
Click on the Services subtab and click Create service.
Select the speech
type, then select the speech:speechio
model.
Click Add module, then enter a name for your speech service and click Create.
On the new component panel, copy and paste the following attribute template into your sensor’s Attributes box:
{
"speech_provider": "google|elevenlabs",
"speech_provider_key": "<SECRET-KEY>",
"speech_voice": "<VOICE-OPTION>",
"completion_provider": "openai",
"completion_model": "gpt-4|gpt-3.5-turbo",
"completion_provider_org": "<org-abc123>",
"completion_provider_key": "<sk-mykey>",
"completion_persona": "<PERSONA>",
"listen": true,
"listen_provider": "google",
"listen_trigger_say": "<TRIGGER-PHRASE>",
"listen_trigger_completion": "<COMPLETION-PHRASE>",
"listen_trigger_command": "<COMMAND-TO-RETRIEVE-STORED-TEXT>",
"listen_command_buffer_length": 10,
"listen_phrase_time_limit": 5,
"mic_device_name": "myMic",
"cache_ahead_completions": false,
"disable_mic": false
}
Note
For more information, see Configure a Machine.
The following attributes are available for the viam-labs:speech:speechio
speech service:
Name | Type | Inclusion | Description |
---|---|---|---|
speech_provider |
string | Optional | The speech provider for the voice service: "google" or "elevenlabs" . Default: "google" . |
speech_provider_key |
string | Required | The secret key for the provider - only required for elevenlabs. Default: "" . |
speech_voice |
string | Optional | If the speech_provider (example: elevenlabs) provides voice options, you can select the voice here. Default: "Josh" . |
completion_provider |
string | Optional | "openai" . Other providers may be supported in the future. completion_provider_org and completion_provider_key must also be provided. Default: "openai" . |
completion_model |
string | Optional | gpt-4o , gpt-4o-mini , etc. completion_provider_org and completion_provider_key must also be provided. Default: "gpt-4o" . |
completion_provider_org |
string | Optional | Your org for the completion provider. Default: "" . |
completion_provider_key |
string | Optional | Your key for the completion provider. Default: "" . |
completion_persona |
string | Optional | If set, will pass "As <completion_persona> respond to '<completion_text>'" to all completion() requests. Default: "" . |
listen |
boolean | Optional | If set to true and the robot as an available microphone device, will enable listening in the background. If enabled, it will respond to configured listen_trigger_say, listen_trigger_completion and listen_trigger_command, based on input audio being converted to text. If listen is enabled and listen_triggers_active is disabled, triggers will occur when listen_trigger is called. Note that background (ambient) noise and microphone quality are important factors in the quality of the STT conversion. Currently, Google STT is leveraged. Default: false . |
listen_provider |
string | Optional | This can be set to the name of a configured speech service that provides to_text and listen commands, like stt-vosk *. Otherwise, the Google STT API will be used. Default: "google" . |
listen_phrase_time_limit |
float | Optional | The maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If phrase_timeout is None, there will be no phrase time limit. Note: if you are seeing instance where phrases are not being returned for much longer than you expect, try changing this to ~5 or so. Default: None . |
listen_trigger_say |
string | Optional | If listen is true, any audio converted to text that is prefixed with listen_trigger_say will be converted to speech and repeated back by the robot. Default: "robot say" . |
listen_trigger_completion |
string | Optional | If listen is true, any audio converted to text that is prefixed with listen_trigger_completion will be sent to the completion provider (if configured), converted to speech, and repeated back by the robot. Default: "hey robot" . |
listen_trigger_command |
string | Optional | If "listen": true , any audio converted to text that is prefixed with listen_trigger_command will be stored in a LIFO buffer (list of strings) of size listen_command_buffer_length that can be retrieved via get_commands(), enabling programmatic voice control of the robot. Default: "robot can you" . |
listen_command_buffer_length |
integer | Optional | The buffer length for the command. Default: 10 . |
mic_device_name |
string | Optional | If not set, will attempt to use the first available microphone device. If set, will attempt to use a specifically labeled device name. Available microphone device names will logged on module startup. Default: "" . |
cache_ahead_completions |
boolean | Optional | If true, will read a second completion for the request and cache it for next time a matching request is made. This is useful for faster completions when completion text is less variable. Default: false . |
disable_mic |
boolean | Optional | If true, will not configure any listening capabilities. This must be set to true if you do not have a valid microphone attached to your system. Default: false . |
disable_audioout |
boolean | Optional | If true, will not configure any audio output capabilities. This must be set to true if you do not have a valid audio output device attached to your system. Default: false . |
*If the listen_provider
is another speech service, it should be set as a dependency for the "speechio" service. This must be done using the "Raw JSON" editor within the robot configuration by setting the "depends_on"
field for the service:
{
"name": "speechio",
"type": "speech",
"namespace": "viam-labs",
"model": "viam-labs:speech:speechio",
"attributes": {
"listen_provider": "vosk",
/* other configuration for the service */
},
"depends_on": [
"vosk"
]
}
In the above case, the listen_provider
and depends_on
value are set to the name of the configured viam-labs:speech:stt-vosk
service for the robot config.
The following configuration sets up listening mode with local speech-to-text, uses an ElevenLabs voice "Antoni", makes AI completions available, and uses a 'Gollum' persona for AI completions:
{
"completion_provider_org": "org-abc123",
"completion_provider_key": "sk-mykey",
"completion_persona": "Gollum",
"listen": true,
"listen_provider": "stt",
"speech_provider": "elevenlabs",
"speech_provider_key": "keygoeshere",
"speech_voice": "Antoni",
"mic_device_name": "myMic"
}
When using a USB audio device, it may sometimes come up as the default, sometimes not. To ensure that it comes up consistently as the default, there are a couple things you can try:
- Run
aplay -l
, you will see output similar to:
**** List of PLAYBACK Hardware Devices ****
card 1: rockchipdp0 [rockchip-dp0], device 0: rockchip-dp0 spdif-hifi-0 [rockchip-dp0 spdif-hifi-0]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 2: rockchiphdmi0 [rockchip-hdmi0], device 0: rockchip-hdmi0 i2s-hifi-0 [rockchip-hdmi0 i2s-hifi-0]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 3: rockchipes8388 [rockchip-es8388], device 0: dailink-multicodecs ES8323 HiFi-0 [dailink-multicodecs ES8323 HiFi-0]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 4: UACDemoV10 [UACDemoV1.0], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
Identify the device card number you wish to use for output.
In our example, we'll use the USB audio device (card 4).
As root, add the following to /etc/asound.conf
defaults.pcm.card 4
defaults.ctl.card 4
- check the existing alsa modules:
cat /proc/asound/modules
This will output something like:
0 snd_usb_audio
2 snd_soc_meson_card_utils
3 snd_usb_audio
- ensure the USB device comes up first by editing /etc/modprobe.d/alsa-base.conf, adding content similar to:
options snd slots=snd-usb-audio,snd_soc_meson_card_utils