NAOqi Audio API

Author: ancient-sentinel - Last Updated: 11/03/2020

Module Summaries

Speech Recognition:

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/alspeechrecognition/alspeechrecognition-api

Pre-initialize with list of phrases that should be recognized by the robot
While operating, ALSpeechRecongition sets a boolean indicating whether a speaker is currently heard
If a speaker is heard, the element of the known phrases list that best matches what is heard is placed in the key WordRecognizer and the key WordRecognizedAndGrammar

Vocal Emotion Analysis

https://developer.softbankrobotics.com/nao-naoqi-2-1/naoqi-developer-guide/naoqi-framework/naoqi-apis/naoqi-audio/alvoiceemotionanalysis

Emotion analysis independent of what is said by the speaker
- Note: depends on ALSpeechRecognition for speech detection, so the module needs to be started
Identifies the emotion expressed by the speaker's voice
At the end of the utterance, the engine processes the emotion level and raises an event
- Event structure:

[
  [ matched emotion index, matched emotion level ],
  emotion levels: [ calm, anger, joy, sorrow, laughter ],
  excitement level
]

Where:

“Matched emotion index” is the index of the dominant emotion selected by the engine in the emotion levels vector

    0 = unknown
    1 = calm
    2 = anger
    3 = joy
    4 = sorrow

Laughter is not really considered an emotion, so it will never be the dominant one.

“Matched emotion level” is the level of the dominant emotion
“Emotion levels” is the vector of the emotions’ scores. Scores are integers between 0 and 10.
“Excitement level” is a separate value that measures the amount of excitement in the voice. High excitement is often linked with joy or anger.

Sound Localization

https://developer.softbankrobotics.com/nao-naoqi-2-1/naoqi-developer-guide/naoqi-framework/naoqi-apis/naoqi-audio/alsoundlocalization

Identifies the direction of any sound loud enough to be heard by the robot
Uses Time Difference Of Arrival (TDOA) from sounds detected by the robots four microphones to determine source location
Each time a sound is detected, the location is computed and published in an event
- Event structure:

[ 
  [time(sec), time(usec)],
  [azimuth(rad), elevation(rad), confidence, energy],
  [Head Position[6D]] in FRAME_TORSO,
  [Head Position[6D]] in FRAME_ROBOT
]

Maximum theoretical accuracy is 10 degrees
Limited by how clearly the source can be heard with respect to background noise
Will detect and locate any loud sounds since it cannot know whether the source is not human
Less reliable when exposed to several loud noises at the same time
Potential use cases
- Noisy event localization
  - use sound localization to anticipate human presence and bring into camera view
- Sound source separation
  - use localization estimates to strengthen signal/noise ratio in corresponding direction (beamforming) to enhance subsequent audio based algorithms
- Multimodal applications
  - security: track noises in empty room and take pictures of source location
  - entertainment: identify speakers and understand what is being said to allow robot to take part in games with humans

Text-To-Speech

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/altexttospeech/altexttospeech-api

Allows the robot to speak
Sends commands to a text-to-speech engine and authorizes voice customization, resulting synthesis is sent to the robot's speakers
Can use tags in text to change ptich, speed, and volume in the middle of a sentence, add pauses between words, and change emphasis

Animated Speech

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/alanimatedspeech

Allows the robot to talk in an expressive way
Process
- Module receives a text that can be annotates with instructions
- Text is split into little chunks
- Analyzes the text and annotations and adds contextual moves for the things that are recognized
- Any parts that aren't annotated are filled with animations launched by ALSpeakingMovement
- Module prepares robot to execute each instruction
- Module ensures that the robot says the text and launches the instructions at the same time
Annotated text is string combining text to be said and intructions managing behaviors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAOqi Audio API

Module Summaries

Speech Recognition:

Vocal Emotion Analysis

Sound Localization

Text-To-Speech

Animated Speech

🏠 Home

SIMYAN SDK

simutils Package

Service Modules

Tutorials

Jetson Nano

NAO

NAOqi Framework

Tutorials

APIs

qi Framework

Utilities

Resources

Project Documents

Clone this wiki locally