Skip to content

jessb0t/emoSPIN

Repository files navigation

Comparing Human and LLM Transcription of Emotional Speech in Noise

In a recent experiment by Alexander & Llanos, human listeners were tasked with transcribing emotional speech in background noise. Unsurprisingly, human transcribers performed better at higher SNRs. Of greater interest, performance was also better for happy and angry prosodies relative to neutral. Given recent work comparing human performance to the capacities of speech-based large language models (Patman & Chodroff, 2024 and Kim et al., 2024), I wondered: how might speech-to-text LLMs fare with emotional speech? This mini study extracts transcriptions from five speech-to-text models (Wav2Vec2.0 (base), Wav2Vec2.0 (large), Whisper (base), Whisper (large), and SpeechT5) and compares their performance to human listeners.

Repository License: CC BY-SA 4.0

Written Summary with Plots

emoSPIN-summary.Rmd

Please open the knitted .html in a web browser to read about the project.

Data Extraction

speecht5.py, wav2vec2_base.py, wav2vec2_large.py, whisper_base.py, whisper_large.py

Each .py file extracts transcriptions for all stimuli from the associated LLM. Transcriptions are stored in outputs/.

Human Data

Human performance data is available on OSF.

About

project | comparing humans and LLMs on decoding emotional speech in noise

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published