-
Notifications
You must be signed in to change notification settings - Fork 1
NAOqi Audio API
anguyen7-99 edited this page May 9, 2021
·
3 revisions
Author: ancient-sentinel - Last Updated: 11/03/2020
- Pre-initialize with list of phrases that should be recognized by the robot
- While operating, ALSpeechRecongition sets a boolean indicating whether a speaker is currently heard
- If a speaker is heard, the element of the known phrases list that best matches what is heard is placed in the key WordRecognizer and the key WordRecognizedAndGrammar
- Emotion analysis independent of what is said by the speaker
- Note: depends on ALSpeechRecognition for speech detection, so the module needs to be started
- Identifies the emotion expressed by the speaker's voice
- At the end of the utterance, the engine processes the emotion level and raises an event
- Event structure:
[
[ matched emotion index, matched emotion level ],
emotion levels: [ calm, anger, joy, sorrow, laughter ],
excitement level
]
Where:
“Matched emotion index” is the index of the dominant emotion selected by the engine in the emotion levels vector
0 = unknown
1 = calm
2 = anger
3 = joy
4 = sorrow
Laughter is not really considered an emotion, so it will never be the dominant one.
- “Matched emotion level” is the level of the dominant emotion
- “Emotion levels” is the vector of the emotions’ scores. Scores are integers between 0 and 10.
- “Excitement level” is a separate value that measures the amount of excitement in the voice. High excitement is often linked with joy or anger.
- Identifies the direction of any sound loud enough to be heard by the robot
- Uses Time Difference Of Arrival (TDOA) from sounds detected by the robots four microphones to determine source location
- Each time a sound is detected, the location is computed and published in an event
- Event structure:
[
[time(sec), time(usec)],
[azimuth(rad), elevation(rad), confidence, energy],
[Head Position[6D]] in FRAME_TORSO,
[Head Position[6D]] in FRAME_ROBOT
]
- Maximum theoretical accuracy is 10 degrees
- Limited by how clearly the source can be heard with respect to background noise
- Will detect and locate any loud sounds since it cannot know whether the source is not human
- Less reliable when exposed to several loud noises at the same time
- Potential use cases
- Noisy event localization
- use sound localization to anticipate human presence and bring into camera view
- Sound source separation
- use localization estimates to strengthen signal/noise ratio in corresponding direction (beamforming) to enhance subsequent audio based algorithms
- Multimodal applications
- security: track noises in empty room and take pictures of source location
- entertainment: identify speakers and understand what is being said to allow robot to take part in games with humans
- Noisy event localization
- Allows the robot to speak
- Sends commands to a text-to-speech engine and authorizes voice customization, resulting synthesis is sent to the robot's speakers
- Can use tags in text to change ptich, speed, and volume in the middle of a sentence, add pauses between words, and change emphasis
- Allows the robot to talk in an expressive way
- Process
- Module receives a text that can be annotates with instructions
- Text is split into little chunks
- Analyzes the text and annotations and adds contextual moves for the things that are recognized
- Any parts that aren't annotated are filled with animations launched by ALSpeakingMovement
- Module prepares robot to execute each instruction
- Module ensures that the robot says the text and launches the instructions at the same time
- Annotated text is string combining text to be said and intructions managing behaviors