Voice Activation #25

RicardoEPRodrigues · 2024-04-08T14:22:14Z

Hey,

I saw you had voice activation planned for this tool. I ended up implementing a basic solution on my end and wanted to share it with you.

In my use case, I really needed an open mic where the player spoke when they wanted. So I implemented voice detection through changes in the intensity of the audio (not perfect but it kinda works). I leave the code below if you want to try it out.

In essence, I pick up the data points from the CapturableSoundWave and create an average over time. If for a given time frame, the values deviate enough from the average, then I activate the voice recording. The recording stops after a second if the data points return to the average (the player stopped speaking).

Header

#pragma once

#include "CoreMinimal.h"
#include "SpeechRecognizer.h"
#include "MyRecorderSpeechRecognition.generated.h"

/**
 * 
 */
UCLASS(Blueprintable)
class MY_API UMyRecorderSpeechRecognition : public UObject
{
	GENERATED_BODY()

public:
	virtual void BeginDestroy() override;

	virtual void Init();
	virtual void StartAudioSession(bool bIsMuted);
	virtual void StopAudioSession();
	virtual void SetAudioSessionMute(const bool bIsMuted);
	virtual bool GetAudioSessionMute();

	/**
	 * Audio Input Device ID (aka which microphone to use).
	 */
	UPROPERTY(BlueprintReadWrite, EditAnywhere, Category="STT")
	int DeviceId = 0;

	/**
	 * Window of captured audio data to be stored prior to voice activation.
	 */
	UPROPERTY(BlueprintReadWrite, EditAnywhere, Category="STT|Voice Activation")
	int VoiceActivationWindow = 5;

	/**
	 * Time to wait after voice activation and before sending the audio data to be recognized.
	 */
	UPROPERTY(BlueprintReadWrite, EditAnywhere, Category="STT|Voice Activation")
	float VoiceActivationDelay = 1.0f;

	/**
	 * Multiplier of the deviation to detect a significant shift in volume to trigger voice activation.
	 */
	UPROPERTY(BlueprintReadWrite, EditAnywhere, Category="STT|Voice Activation")
	float VoiceActivationDeviationMultiplier = 2.0f;

	/**
	 * Minimum value the deviation can have to offer better results. (Most relevant in quiet environments.)
	 */
	UPROPERTY(BlueprintReadWrite, EditAnywhere, Category="STT|Voice Activation")
	float VoiceActivationMinDeviation = 0.5f;

protected:

	UPROPERTY()
	TArray<float> AudioData;

	float ActiveAverage = -1;
	float ActiveDeviation = -1;
	
	// STT Section
	UPROPERTY()
	bool IsAudioSessionMuted = false;

	UPROPERTY(BlueprintReadWrite)
	class USpeechRecognizer* SpeechRecognizer;

	UPROPERTY(BlueprintReadWrite)
	class UCapturableSoundWave* CapturableSoundWave;

	UPROPERTY()
	FOnSpeechRecognitionStartedDynamic OnStartSpeechRecognitionEvent;

	UFUNCTION()
	void OnRecognitionFinished();
	UFUNCTION()
	void OnRecognitionError(const FString& ShortErrorMessage, const FString& LongErrorMessage);
	UFUNCTION()
	void OnRecognizedTextSegment(const FString& RecognizedWords);

	UFUNCTION()
	void OnStartSpeechRecognition(bool bSucceeded);

	UFUNCTION()
	void OnPopulateAudioData(const TArray<float>& PopulatedAudioData);

	UFUNCTION()
	void Reset();

	bool IsVoiceActivated = false;
	
	FTimerDelegate VoiceActivationDelegate{};
	FTimerHandle VoiceActivationTimerHandle{};

	UFUNCTION()
	void OnVoiceActivation();
};

Code

// Fill out your copyright notice in the Description page of Project Settings.


#include "MyRecorderSpeechRecognition.h"

#include "Sound/CapturableSoundWave.h"

void UMyRecorderSpeechRecognition::BeginDestroy()
{
	if (SpeechRecognizer)
	{
		SpeechRecognizer->StopSpeechRecognition();
		SpeechRecognizer->OnRecognitionFinished.RemoveDynamic(
			this, &UMyRecorderSpeechRecognition::OnRecognitionFinished);
		SpeechRecognizer->OnRecognitionError.
		                  RemoveDynamic(this, &UMyRecorderSpeechRecognition::OnRecognitionError);
		SpeechRecognizer->OnRecognizedTextSegment.RemoveDynamic(
			this, &UMyRecorderSpeechRecognition::OnRecognizedTextSegment);
	}

	OnStartSpeechRecognitionEvent.Clear();

	if (CapturableSoundWave)
	{
		CapturableSoundWave->StopCapture();
		CapturableSoundWave->OnPopulateAudioData.RemoveDynamic(
			this, &UMyRecorderSpeechRecognition::OnPopulateAudioData);
	}

	if (GetWorld() && GetWorld()->GetTimerManager().IsTimerActive(VoiceActivationTimerHandle))
	{
		GetWorld()->GetTimerManager().ClearTimer(VoiceActivationTimerHandle);
	}

	Super::BeginDestroy();
}

void UMyRecorderSpeechRecognition::Init()
{
	SpeechRecognizer = USpeechRecognizer::CreateSpeechRecognizer();
	SpeechRecognizer->SetLanguage(ESpeechRecognizerLanguage::En);
	SpeechRecognizer->OnRecognitionFinished.AddUniqueDynamic(
		this, &UMyRecorderSpeechRecognition::OnRecognitionFinished);
	SpeechRecognizer->OnRecognitionError.AddUniqueDynamic(this, &UMyRecorderSpeechRecognition::OnRecognitionError);
	SpeechRecognizer->OnRecognizedTextSegment.AddUniqueDynamic(
		this, &UMyRecorderSpeechRecognition::OnRecognizedTextSegment);
	SpeechRecognizer->SetStreamingDefaults();
	SpeechRecognizer->SetSuppressBlank(true);
	SpeechRecognizer->SetSuppressNonSpeechTokens(true);
	SpeechRecognizer->SetNumOfThreads(0);
	SpeechRecognizer->SetStepSize(0);

	OnStartSpeechRecognitionEvent.BindDynamic(this, &UMyRecorderSpeechRecognition::OnStartSpeechRecognition);

	CapturableSoundWave = UCapturableSoundWave::CreateCapturableSoundWave();
	CapturableSoundWave->OnPopulateAudioData.AddUniqueDynamic(
		this, &UMyRecorderSpeechRecognition::OnPopulateAudioData);
}

void UMyRecorderSpeechRecognition::StartAudioSession(bool bIsMuted)
{
	if (!SpeechRecognizer)
	{
		UE_LOG(LogTemp, Error, TEXT("Unable to start audio session. Speech Recognizer is not defined."));
		return;
	}

	SpeechRecognizer->StartSpeechRecognition(OnStartSpeechRecognitionEvent);
	IsAudioSessionMuted = bIsMuted;
}

void UMyRecorderSpeechRecognition::StopAudioSession()
{
	if (CapturableSoundWave)
	{
		CapturableSoundWave->StopCapture();
	}
	if (SpeechRecognizer)
	{
		SpeechRecognizer->StopSpeechRecognition();
	}
	Reset();
}

void UMyRecorderSpeechRecognition::SetAudioSessionMute(const bool bIsMuted)
{
	if (IsAudioSessionMuted == bIsMuted)
	{
		return;
	}

	IsAudioSessionMuted = bIsMuted;
	if (!CapturableSoundWave)
	{
		return;
	}

	CapturableSoundWave->ToggleMute(IsAudioSessionMuted);
}

bool UMyRecorderSpeechRecognition::GetAudioSessionMute()
{
	return IsAudioSessionMuted;
}

void UMyRecorderSpeechRecognition::OnRecognitionFinished()
{
}

void UMyRecorderSpeechRecognition::OnRecognitionError(const FString& ShortErrorMessage,
                                                           const FString& LongErrorMessage)
{
	UE_LOG(LogTemp, Error, TEXT("Speech Recognition Error. %s. %s"), *ShortErrorMessage, *LongErrorMessage);
}

void UMyRecorderSpeechRecognition::OnRecognizedTextSegment(const FString& RecognizedWords)
{
	// Send text to game.
}

void UMyRecorderSpeechRecognition::OnStartSpeechRecognition(bool bSucceeded)
{
	if (!CapturableSoundWave)
	{
		UE_LOG(LogTemp, Error, TEXT("Unable to start speech recognition. CapturableSoundWave is not defined."));
		return;
	}

	Reset();

	CapturableSoundWave->StartCapture(DeviceId);
	if (!IsAudioSessionMuted) SetAudioSessionMute(IsAudioSessionMuted);
}

float MathSumAbs(const TArray<float>& Population)
{
	float Std = 0;

	for (const auto Data : Population)
	{
		Std += FMath::Abs(Data);
	}
	return Std;
}

void UMyRecorderSpeechRecognition::OnPopulateAudioData(const TArray<float>& PopulatedAudioData)
{
	if (!SpeechRecognizer)
	{
		UE_LOG(LogTemp, Error, TEXT("Unable to process audio data. SpeechRecognizer is not defined."));
		return;
	}
	if (!CapturableSoundWave)
	{
		UE_LOG(LogTemp, Error, TEXT("Unable to process audio data. CapturableSoundWave is not defined."));
		return;
	}

	const float Sum = MathSumAbs(PopulatedAudioData);
	const float Deviation = FMath::Abs(ActiveAverage - Sum);

	if (ActiveAverage <= 0)
	{
		ActiveAverage = Sum;
		ActiveDeviation = FMath::Max(Deviation, VoiceActivationMinDeviation);
		return;
	}

	if (Sum > ActiveAverage + (ActiveDeviation * VoiceActivationDeviationMultiplier))
	{
		FTimerManager& TimerManager = GetWorld()->GetTimerManager();
		if (TimerManager.IsTimerActive(VoiceActivationTimerHandle))
		{
			TimerManager.ClearTimer(VoiceActivationTimerHandle);
		}
		else
		{
			UE_LOG(LogTemp, Log, TEXT("Voice Activation: ON"));
		}

		VoiceActivationDelegate.BindUObject(this, &UMyRecorderSpeechRecognition::OnVoiceActivation);
		TimerManager.SetTimer(VoiceActivationTimerHandle, VoiceActivationDelegate,
		                      VoiceActivationDelay, false);
		IsVoiceActivated = true;
	}

	ActiveAverage = (ActiveAverage * 0.3f) + (Sum * 0.7f);
	ActiveDeviation = FMath::Max((ActiveDeviation + Deviation) * 0.5f, VoiceActivationMinDeviation);

	if (!IsVoiceActivated)
	{
		const int WindowNum = PopulatedAudioData.Num() * (VoiceActivationWindow - 1);
		if (AudioData.Num() > WindowNum)
		{
			const int NumDataToRemove = AudioData.Num() - WindowNum;
			AudioData.RemoveAt(0, NumDataToRemove);
		}
	}
	AudioData.Append(PopulatedAudioData);

	// SpeechRecognizer->ProcessAudioData(PopulatedAudioData, CapturableSoundWave->GetSampleRate(),
	//                                    CapturableSoundWave->GetNumOfChannels(), false);
}

void UMyRecorderSpeechRecognition::Reset()
{
	ActiveAverage = -1;
	ActiveDeviation = VoiceActivationMinDeviation;

	IsVoiceActivated = false;

	AudioData.Empty();
}

void UMyRecorderSpeechRecognition::OnVoiceActivation()
{
	UE_LOG(LogTemp, Log, TEXT("Voice Activation: OFF"));

	SpeechRecognizer->ProcessAudioData(AudioData, CapturableSoundWave->GetSampleRate(),
	                                   CapturableSoundWave->GetNumOfChannels(), true);

	AudioData.Empty();
	IsVoiceActivated = false;
}

gtreshchev · 2024-04-18T21:28:54Z

Hi, thank you for sharing! :) I'll see if it can be integrated into the plugin, maybe as a separate function 👍

gtreshchev · 2024-06-18T18:32:49Z

VAD is implemented within gtreshchev/RuntimeAudioImporter@a52e376, so this is no longer relevant

RicardoEPRodrigues · 2024-06-19T08:44:21Z

Just a quick note for which I can open an issue on the Audio Importer if relevant enough. The VAD is too sensitive, any small amount of background noise triggers the system to start recording. I had to keep my average-based solution on top of the VAD to have a good experience.

gtreshchev · 2024-06-19T15:45:43Z

Just a quick note for which I can open an issue on the Audio Importer if relevant enough. The VAD is too sensitive, any small amount of background noise triggers the system to start recording. I had to keep my average-based solution on top of the VAD to have a good experience.

I wonder if you're not using the VeryAggressive VAD mode on the capturable sound wave, which might be why it's too sensitive for you?

RicardoEPRodrigues · 2024-06-19T17:06:10Z

No no, I am using the Very Aggressive mode. Apparently it is an issue with libfvad as you can see here: dpirch/libfvad#23 (comment)

I tested my implementation (with VAD and with VAD + my solution) with 3 microphones and set up a fan (with multiple different speeds) pointed in their general direction. Ideally, VAD would only activate when hearing my voice, which was not the case.

gtreshchev · 2024-06-21T12:04:26Z

No no, I am using the Very Aggressive mode. Apparently it is an issue with libfvad as you can see here: dpirch/libfvad#23 (comment)

I tested my implementation (with VAD and with VAD + my solution) with 3 microphones and set up a fan (with multiple different speeds) pointed in their general direction. Ideally, VAD would only activate when hearing my voice, which was not the case.

I wonder if you could test this with the Very Aggressive mode? I've also been able to reproduce the behavior where noise is mistaken for voice in VAD, but when using the Very Aggressive mode, it tends to be very selective towards speech.

RicardoEPRodrigues · 2024-06-21T13:59:33Z

As previously stated I am using the very aggressive mode on this. Perhaps it is just a difference in microphones and setup that causes the issue.

gtreshchev added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Apr 25, 2024

gtreshchev closed this as completed Jun 18, 2024

gtreshchev reopened this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Activation #25

Voice Activation #25

RicardoEPRodrigues commented Apr 8, 2024

gtreshchev commented Apr 18, 2024

gtreshchev commented Jun 18, 2024 •

edited

Loading

RicardoEPRodrigues commented Jun 19, 2024

gtreshchev commented Jun 19, 2024

RicardoEPRodrigues commented Jun 19, 2024 •

edited

Loading

gtreshchev commented Jun 21, 2024

RicardoEPRodrigues commented Jun 21, 2024

Voice Activation #25

Voice Activation #25

Comments

RicardoEPRodrigues commented Apr 8, 2024

gtreshchev commented Apr 18, 2024

gtreshchev commented Jun 18, 2024 • edited Loading

RicardoEPRodrigues commented Jun 19, 2024

gtreshchev commented Jun 19, 2024

RicardoEPRodrigues commented Jun 19, 2024 • edited Loading

gtreshchev commented Jun 21, 2024

RicardoEPRodrigues commented Jun 21, 2024

gtreshchev commented Jun 18, 2024 •

edited

Loading

RicardoEPRodrigues commented Jun 19, 2024 •

edited

Loading