In a recent experiment by Alexander & Llanos, human listeners were tasked with transcribing emotional speech in background noise. Unsurprisingly, human transcribers performed better at higher SNRs. Of greater interest, performance was also better for happy and angry prosodies relative to neutral. Given recent work comparing human performance to the capacities of speech-based large language models (Patman & Chodroff, 2024 and Kim et al., 2024), I wondered: how might speech-to-text LLMs fare with emotional speech? This mini study extracts transcriptions from five speech-to-text models (Wav2Vec2.0 (base), Wav2Vec2.0 (large), Whisper (base), Whisper (large), and SpeechT5) and compares their performance to human listeners.
Repository License: CC BY-SA 4.0
Please open the knitted .html in a web browser to read about the project.
Each .py file extracts transcriptions for all stimuli from the associated LLM. Transcriptions are stored in outputs/.
Human performance data is available on OSF.