Skip to content

NAOqi Audio API

anguyen7-99 edited this page May 9, 2021 · 3 revisions

Author: ancient-sentinel - Last Updated: 11/03/2020

Module Summaries

Speech Recognition:

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/alspeechrecognition/alspeechrecognition-api

  • Pre-initialize with list of phrases that should be recognized by the robot
  • While operating, ALSpeechRecongition sets a boolean indicating whether a speaker is currently heard
  • If a speaker is heard, the element of the known phrases list that best matches what is heard is placed in the key WordRecognizer and the key WordRecognizedAndGrammar

Vocal Emotion Analysis

https://developer.softbankrobotics.com/nao-naoqi-2-1/naoqi-developer-guide/naoqi-framework/naoqi-apis/naoqi-audio/alvoiceemotionanalysis

  • Emotion analysis independent of what is said by the speaker
    • Note: depends on ALSpeechRecognition for speech detection, so the module needs to be started
  • Identifies the emotion expressed by the speaker's voice
  • At the end of the utterance, the engine processes the emotion level and raises an event
    • Event structure:
[
  [ matched emotion index, matched emotion level ],
  emotion levels: [ calm, anger, joy, sorrow, laughter ],
  excitement level
]

Where:

“Matched emotion index” is the index of the dominant emotion selected by the engine in the emotion levels vector

    0 = unknown
    1 = calm
    2 = anger
    3 = joy
    4 = sorrow

Laughter is not really considered an emotion, so it will never be the dominant one.

  • “Matched emotion level” is the level of the dominant emotion
  • “Emotion levels” is the vector of the emotions’ scores. Scores are integers between 0 and 10.
  • “Excitement level” is a separate value that measures the amount of excitement in the voice. High excitement is often linked with joy or anger.

Sound Localization

https://developer.softbankrobotics.com/nao-naoqi-2-1/naoqi-developer-guide/naoqi-framework/naoqi-apis/naoqi-audio/alsoundlocalization

  • Identifies the direction of any sound loud enough to be heard by the robot
  • Uses Time Difference Of Arrival (TDOA) from sounds detected by the robots four microphones to determine source location
  • Each time a sound is detected, the location is computed and published in an event
    • Event structure:
[ 
  [time(sec), time(usec)],
  [azimuth(rad), elevation(rad), confidence, energy],
  [Head Position[6D]] in FRAME_TORSO,
  [Head Position[6D]] in FRAME_ROBOT
]
  • Maximum theoretical accuracy is 10 degrees
  • Limited by how clearly the source can be heard with respect to background noise
  • Will detect and locate any loud sounds since it cannot know whether the source is not human
  • Less reliable when exposed to several loud noises at the same time
  • Potential use cases
    • Noisy event localization
      • use sound localization to anticipate human presence and bring into camera view
    • Sound source separation
      • use localization estimates to strengthen signal/noise ratio in corresponding direction (beamforming) to enhance subsequent audio based algorithms
    • Multimodal applications
      • security: track noises in empty room and take pictures of source location
      • entertainment: identify speakers and understand what is being said to allow robot to take part in games with humans

Text-To-Speech

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/altexttospeech/altexttospeech-api

  • Allows the robot to speak
  • Sends commands to a text-to-speech engine and authorizes voice customization, resulting synthesis is sent to the robot's speakers
  • Can use tags in text to change ptich, speed, and volume in the middle of a sentence, add pauses between words, and change emphasis

Animated Speech

https://developer.softbankrobotics.com/nao6/naoqi-developer-guide/naoqi-apis/naoqi-audio/alanimatedspeech

  • Allows the robot to talk in an expressive way
  • Process
    • Module receives a text that can be annotates with instructions
    • Text is split into little chunks
    • Analyzes the text and annotations and adds contextual moves for the things that are recognized
    • Any parts that aren't annotated are filled with animations launched by ALSpeakingMovement
    • Module prepares robot to execute each instruction
    • Module ensures that the robot says the text and launches the instructions at the same time
  • Annotated text is string combining text to be said and intructions managing behaviors
Clone this wiki locally