Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jim-schwoebel committed Jun 12, 2021
1 parent 87e6ae0 commit f6fef18
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,26 @@ There are two main types of audio datasets: speech datasets and audio event/musi
* [DAPS Dataset](https://archive.org/details/daps_dataset) - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
* [Deep Clustering Dataset](https://www.merl.com/demos/deep-clustering) - Training deep discriminative embeddings to solve the cocktail party problem.
* [DEMoS](https://zenodo.org/record/2544829) - 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
* [DES](http://kom.aau.dk/~tb/speech/Emotions/) - 4 speakers (2 males and 2 females); 5 emotions: neutral, surprise, happiness, sadness and anger.
* [DIPCO](https://arxiv.org/abs/1909.13447) - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes.
* [EEKK](https://metashare.ut.ee/repository/download/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/) - 26 text passage read by 10 speakers; 4 main emotions: joy, sadness, anger and neutral.
* [Emo-DB](http://emodb.bilderbar.info/index-1280.html) - 800 recording spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust.
* [EmoFilm](https://zenodo.org/record/1326428) - 1115 audio instances sentences extracted from various films.
* [EmoSynth](https://zenodo.org/record/3727593) - 144 audio file labelled by 40 listeners; Emotion (no speech) defined in regard of valence and arousal.
* [Emotional Voices Database](https://github.com/numediart/EmoV-DB) - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
* [Emotional Voice dataset - Nature](https://www.nature.com/articles/s41562-019-0533-6) - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
* [EmotionTTS](https://github.com/emotiontts/emotiontts_open_db) - Recordings and their associated transcriptions by a diverse group of speakers - 4 emotions: general, joy, anger, and sadness.
* [Emov-DB](https://mega.nz/#F!KBp32apT!gLIgyWf9iQ-yqnWFUFuUHg!mYwUnI4K) - Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust and amused.
* [EMOVO](http://voice.fub.it/activities/corpora/emovo/index.html) - 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness.
* [eNTERFACE05](http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip) - Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust.
* [Free Spoken Digit Dataset](https://github.com/Jakobovski/free-spoken-digit-dataset) -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
* [Flickr Audio Caption](https://groups.csail.mit.edu/sls/downloads/flickraudio/) - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
* [GEMEP corpus](https://www.unige.ch/cisa/gemep) - 10 actors portraying 10 states; 12 emotions: amusement, anxiety, cold anger (irritation), despair, hot anger (rage), fear (panic), interest, joy (elation), pleasure(sensory), pride, relief, and sadness. Plus, 5 additional emotions: admiration, contempt, disgust, surprise, and tenderness.
* [IEMOCAP](https://sail.usc.edu/iemocap/iemocap_release.htm) - 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration and neutral.
* [ISOLET Data Set](https://data.world/uci/isolet) - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
* [JL corpus](https://www.kaggle.com/tli725/jl-corpus) - 2400 recording of 240 sentences by 4 actors (2 males and 2 females); 5 primary emotions: angry, sad, neutral, happy, excited. 5 secondary emotions: anxious, apologetic, pensive, worried, enthusiastic.
* [Keio-ESD](http://research.nii.ac.jp/src/en/Keio-ESD.html) - A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joyful, disgusting, downgrading, funny, worried, gentle, relief, indignation, shameful, etc.
* [LEGO Corpus](https://www.ultes.eu/ressources/lego-spoken-dialogue-corpus/) - 347 dialogs with 9,083 system-user exchanges; emotions classified as garbage, non-angry, slightly angry and very angry.
* [Libriadapt](https://github.com/akhilmathurs/libriadapt) - It is primarily designed to faciliate domain adaptation research for ASR models, and contains the following three types of domain shifts in the data.
* [Libri-CSS](https://github.com/chenzhuo1011/libri_css) - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
* [LibriMix](https://github.com/JorisCos/LibriMix) - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.
Expand All @@ -58,6 +65,8 @@ There are two main types of audio datasets: speech datasets and audio event/musi
* [The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)](https://zenodo.org/record/1188976#.XrC7a5NKjOR) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions.
* [sample_voice_data](https://github.com/jim-schwoebel/sample_voice_data) - 52 audio files per class (males and females) for testing purposes.
* [SAVEE Dataset](http://kahlan.eps.surrey.ac.uk/savee/) - 4 male actors in 7 different emotions, 480 British English utterances in total.
* [SEMAINE](https://semaine-db.eu/) - 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions; 5 FeelTrace annotations: activation, valence, dominance, power, intensity.
* [SER Datasets](https://github.com/SuperKogito/SER-datasets) - A collection of datasets for the purpose of emotion recognition/detection in speech.
* [SEWA](https://db.sewaproject.eu/) - more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
* [ShEMO](https://github.com/mansourehk/ShEMO) - 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers; 6 emotions: anger, fear, happiness, sadness, neutral and surprise.
* [SparseLibriMix](https://github.com/popcornell/SparseLibriMix) - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version.
Expand All @@ -67,6 +76,7 @@ There are two main types of audio datasets: speech datasets and audio event/musi
* [Spoken Wikipeida Corpora](https://nats.gitlab.io/swc/) - 38 GB in size available in both audio and without audio format.
* [Tatoeba](https://tatoeba.org/eng/downloads) - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
* [Ted-LIUM](https://www.openslr.org/51/) - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
* [TESS](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 recording by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral.
* [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts/) - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
* [TIMIT dataset](https://catalog.ldc.upenn.edu/LDC93S1) - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
* [URDU-Dataset](https://github.com/siddiquelatif/urdu-dataset) - 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
Expand Down

0 comments on commit f6fef18

Please sign in to comment.