Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
emotion datasets
  • Loading branch information
jim-schwoebel committed Jun 12, 2021
1 parent 55c875f commit 87e6ae0
Showing 1 changed file with 23 additions and 0 deletions.
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,38 +7,59 @@ A comprehensive list of open source voice and music datasets. I released this fo
There are two main types of audio datasets: speech datasets and audio event/music datasets.

### Speech datasets
* [AESDD](http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/) - around 500 utterances by a diverse group of actors (over 5 actors) simlating various emotions.
* [ANAD](https://www.kaggle.com/suso172/arabic-natural-audio-dataset) - 1384 recording by multiple speakers; 3 emotions: angry, happy, surprised.
* [Arabic Speech Corpus](http://en.arabicspeechcorpus.com/) - The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.
* [ASR datasets](https://github.com/robmsmt/ASR_Audio_Data_Links) - A list of publically available audio data that anyone can download for ASR or other speech activities
* [AudioMNIST](https://github.com/soerenab/AudioMNIST) - The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
* [Awesome_Diarization](https://github.com/jim-schwoebel/awesome-diarization) - A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
* [BAVED](https://www.kaggle.com/a13x10/basic-arabic-vocal-emotions-dataset) - 1935 recording by 61 speakers (45 male and 16 female).
* [CaFE](https://www.gel.usherbrooke.ca/audio/cafe.htm) - 6 different sentences by 12 speakers (6 fmelaes + 6 males).
* [Common Voice](https://voice.mozilla.org/) - Common Voice is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from a number of public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
* [CHIME](https://archive.org/details/chime-home) - This is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
* [Coswara](https://github.com/iiscleap/Coswara-Data) - A database that contains respiratory sounds, namely, cough, breath, and speech of healthy and COVID-19 positive individuals.
* [CMU-MOSEI](https://www.amir-zadeh.com/datasets) - 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotion (happiness, sadness, anger,fear, disgust, surprise) + Likert scale.
* [CMU-MOSI](https://www.amir-zadeh.com/datasets) - 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps.
* [CMU Wilderness](http://festvox.org/cmu_wilderness/) - (noncommercial) - not available but a great speech dataset many accents reciting passages from the Bible.
* [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D) - CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
* [DAPS Dataset](https://archive.org/details/daps_dataset) - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
* [Deep Clustering Dataset](https://www.merl.com/demos/deep-clustering) - Training deep discriminative embeddings to solve the cocktail party problem.
* [DEMoS](https://zenodo.org/record/2544829) - 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
* [DIPCO](https://arxiv.org/abs/1909.13447) - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes.
* [EmoFilm](https://zenodo.org/record/1326428) - 1115 audio instances sentences extracted from various films.
* [EmoSynth](https://zenodo.org/record/3727593) - 144 audio file labelled by 40 listeners; Emotion (no speech) defined in regard of valence and arousal.
* [Emotional Voices Database](https://github.com/numediart/EmoV-DB) - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
* [Emotional Voice dataset - Nature](https://www.nature.com/articles/s41562-019-0533-6) - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
* [EmotionTTS](https://github.com/emotiontts/emotiontts_open_db) - Recordings and their associated transcriptions by a diverse group of speakers - 4 emotions: general, joy, anger, and sadness.
* [Emov-DB](https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg!mYwUnI4K) - Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust and amused.
* [EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html) - 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness.
* [Free Spoken Digit Dataset](https://github.com/Jakobovski/free-spoken-digit-dataset) -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
* [Flickr Audio Caption](https://groups.csail.mit.edu/sls/downloads/flickraudio/) - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
* [GEMEP corpus](https://www.unige.ch/cisa/gemep) - 10 actors portraying 10 states; 12 emotions: amusement, anxiety, cold anger (irritation), despair, hot anger (rage), fear (panic), interest, joy (elation), pleasure(sensory), pride, relief, and sadness. Plus, 5 additional emotions: admiration, contempt, disgust, surprise, and tenderness.
* [ISOLET Data Set](https://data.world/uci/isolet) - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
* [JL corpus](https://www.kaggle.com/tli725/jl-corpus) - 2400 recording of 240 sentences by 4 actors (2 males and 2 females); 5 primary emotions: angry, sad, neutral, happy, excited. 5 secondary emotions: anxious, apologetic, pensive, worried, enthusiastic.
* [Libriadapt](https://github.com/akhilmathurs/libriadapt) - It is primarily designed to faciliate domain adaptation research for ASR models, and contains the following three types of domain shifts in the data.
* [Libri-CSS](https://github.com/chenzhuo1011/libri_css) - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
* [LibriMix](https://github.com/JorisCos/LibriMix) - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.
* [Librispeech](https://www.openslr.org/12) - LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
* [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) - This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
* [Microsoft Scalable Noisy Speech Dataset](https://github.com/microsoft/MS-SNSD) - The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.
* [MSP-IMPROV](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Improv.html) - 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement
* [MSP Podcast Corpus](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) - 100 hours by over 100 speakers - annotated with emotional labels using attribute-based descriptors (activation, dominance and valence) and categorical labels (anger, happiness, sadness, disgust, surprised, fear, contempt, neutral and other).
* [Multimodal EmotionLines Dataset (MELD)](https://github.com/SenticNet/MELD) - Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Each utterance in a dialogue has been labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear.
* [MuSe-CAR](https://zenodo.org/record/4134758) - 40 hours, 6,000+ recordings of 25,000+ sentences by 70+ English speakers (15 GB).
* [NISQA Speech Quality Corpus](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus) - includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness.
* [Noisy Dataset](https://datashare.is.ed.ac.uk/handle/10283/2791)- Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. Also known as VBD, Voice Bank + DEMAND. Speech samples from VCTK dataset.
* [OGVC](https://sites.google.com/site/ogcorpus/home/en) - 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors (two male and two female); 9 emotional states: fear, surprise, sadness, disgust, anger, anticipation, joy, acceptance and the neutral state.
* [OpenSLR](https://openslr.org) - Many audio datasets (>109) published for speech recognition purposes.
* [Parkinson's speech dataset](https://archive.ics.uci.edu/ml/datasets/Parkinson+Speech+Dataset+with++Multiple+Types+of+Sound+Recordings) - The training data belongs to 20 Parkinson’s Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken for this 20 MB set.
* [Persian Consonant Vowel Combination (PCVC) Speech Dataset](https://github.com/S-Malek/PCVC) - The Persian Consonant Vowel Combination (PCVC) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
* [RECOLA](https://diuf.unifr.ch/main/diva/recola/download.html) - 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal).
* [The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)](https://zenodo.org/record/1188976#.XrC7a5NKjOR) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions.
* [sample_voice_data](https://github.com/jim-schwoebel/sample_voice_data) - 52 audio files per class (males and females) for testing purposes.
* [SAVEE Dataset](http://kahlan.eps.surrey.ac.uk/savee/) - 4 male actors in 7 different emotions, 480 British English utterances in total.
* [SEWA](https://db.sewaproject.eu/) - more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
* [ShEMO](https://github.com/mansourehk/ShEMO) - 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers; 6 emotions: anger, fear, happiness, sadness, neutral and surprise.
* [SparseLibriMix](https://github.com/popcornell/SparseLibriMix) - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version.
* [Speech Accent Archive](https://www.kaggle.com/rtatman/speech-accent-archive/version/1) - For various accent detection tasks.
* [Speech Commands Dataset](http://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) - The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website.
Expand All @@ -48,8 +69,10 @@ There are two main types of audio datasets: speech datasets and audio event/musi
* [Ted-LIUM](https://www.openslr.org/51/) - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
* [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts/) - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
* [TIMIT dataset](https://catalog.ldc.upenn.edu/LDC93S1) - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
* [URDU-Dataset](https://github.com/siddiquelatif/urdu-dataset) - 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
* [VCTK dataset](https://datashare.is.ed.ac.uk/handle/10283/3443) - 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
* [VCTK-2Mix](https://github.com/JorisCos/VCTK-2Mix) - VCTK-2Mix is an open source dataset for source separation in noisy environments. It is derived from VCTK signals and WHAM noise. It is meant as a test set. It will also enable cross-dataset experiments.
* [VIVAE](https://zenodo.org/record/4066235) - non-speech, 1085 audio file by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak).
* [Voice Gender Detection](https://github.com/jim-schwoebel/voice_gender_detection) - GitHub repo for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females).
* [VOiCES Dataset](https://iqtlabs.github.io/voices/) - The Voices Obscured in Complex Environmental Settings (VOiCES) corpus is a creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
* [VoxCeleb](https://github.com/andabi/voice-vector) - VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from You Tube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.
Expand Down

0 comments on commit 87e6ae0

Please sign in to comment.