Skip to content

Files

Latest commit

69f5808 · Aug 10, 2019

History

History
160 lines (140 loc) · 5.17 KB

README.md

File metadata and controls

160 lines (140 loc) · 5.17 KB

The Abuse Project Audio Dataset (TAPAD)

World's largest profanity audio dataset

PICTURE logo
Dataset consists of ‭26,365 audio files
Click here for documentation

See The Abuse Project

TAPAD (∿) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.

Current Status & ID3

Category Const
Total files 26,365
Dataset updated July 30, 2019
Language classes 75
File Type MP3
Mime Type audio/mpeg
Mpeg Audio Version 2
Audio Layer 3
Audio Bitrate 32 kbps
Sample Rate 24000
Channel Mode Single Channel
Ms Stereo Off
Intensity Stereo Off
Codec Type audio
Codec Time Base 1/24000
Codec Tag 0x0000
Sample Fmt fltp
Sample Rate 24000
Channels 1
Channel Layout mono
Bits Per Sample 0
R Frame Rate 0/0
Avg Frame Rate 0/0
Time Base 1/14112000

Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1

Scripts & Utilities

Filename Location Description Type
record.py acquire\custom Records audio in WAV format (default: 3 sec) Helper script
wingen.py acquire\generate TTS conversion using SAPI.SpVoice Helper script
gTTSgen.py acquire\generate TTS conversion using gTTS & abuse 0.1.1 Helper script
gspectogram.py utils Generates spectrogram of a wav file Utility tool

Structure

.
├───af
├───ar
├───bn
├───bs
├───ca
├───cs
├───cy
├───da
├───de
├───el
├───en
│   ├───1 (340 wav files)
│   └───2
├───en-au
├───en-ca
├───en-gb
├───en-gh
├───en-ie
├───en-in
├───en-ng
├───en-nz
├───en-ph
├───en-tz
├───en-uk
├───en-us
├───en-za
├───eo
├───es
├───es-es
├───es-us
├───et
├───fi
├───fr
├───fr-ca
├───fr-fr
├───hi
├───hr
├───hu
├───hy
├───id
├───is
├───it
├───ja
├───jw
├───km
├───ko
├───la
├───lv
├───mk
├───ml
├───mr
├───my
├───ne
├───nl
├───no
├───pl
├───pt
├───pt-br
├───pt-pt
├───ro
├───ru
├───si
├───sk
├───sq
├───sr
├───su
├───sv
├───sw
├───ta
├───te
├───th
├───tl
├───tr
├───uk
├───vi
├───zh-cn
└───zh-tw

Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".

Checking files

find audio/ -type f | wc -l

Made with TAPAD

Did you use or saw TAPAD in a paper, project or app? Add it here!

Maintainers

The dataset is maintained by :

LICENSE

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.