A Large, modern and evolving dataset for automatic speech recognition.
Audio Source | Hours | acoustic | domain |
---|---|---|---|
Podcast | 3,498 | near-field, indoor, spontaneous, background music, genders, ages, accents | daily topics |
Youtube | 3,845 | near & far field, indoor & outdoor, reading & spontaneous, clean & loud background music/noises, genders, ages, accents | vast topics/domains coverage |
Audiobook | 2,655 | slow, clear, readings | books, stories |
total | 10,000 |
We organize the entire dataset via 5 subsets, targeting on different users.
Subset | Notation | Size(Hours) | Target Usage |
---|---|---|---|
eXtra Small | XS | 10 | pipeline/recipe coding & debugging |
Small | S | 250 | quick research experiment for new ideas |
Medium | M | 1000 | serious research experiment / quick industrial experiment |
Large | L | 2500 | serious industrial experiment |
eXtra Large | XL | 10000 | industrial system building |
{XL
includes {L
includes {M
includes {S
includes {XS
}}}}}
- source:
- random drawn audios from crawled podcast & youtube data
- source:
- random drawn audios from crawled podcast & youtube data
- independent to crawling process, GigaSpeech authors manually collected podcasts & videos from internet, to cover wider scenarios & domains.
(Dev + Test) sets contain 50~60 hours raw data, labeled by payed professional human annotators.
To download the dataset, do the following steps:
-
Put aliyun_ossutil.cfg in
SAFEBOX
folder -
Run the following steps for downloading the dataset only
utils/gigaspeech_download.sh /download/destination/dir/for/GigaSpeechDataset
Then the entire dataset will be downloaded to your local dir.
If your network is interrupted or broken during downloading, you can just rerun above command, it will continue with previous downloading.
You can also use above command to update your local GigaSpeech copy with newest GigaSpeech release.
We maintain data preparation scripts for different speech recognition toolkits
in this repository so that when we update the dataset (note, this is an evolving
dataset), we don't have to update the scripts in the downstream toolkits. Data
preparation scripts for different speech recognition toolkits are maintained in
the toolkits/
folder, e.g., toolkits/kaldi
for the Kaldi speech recognition
toolkit.
To use the data preparation scripts, do the following in your toolkit (here we use Kaldi as an example)
git clone https://github.com/SpeechColab/GigaSpeech.git
cd GigaSpeech
utils/gigaspeech_download.sh /disk1/audio_data/gigaspeech
toolkits/kaldi/gigaspeech_data_prep.sh --train-subset XL /disk1/audio_data/gigaspeech ../data
cd ..
- By design we have punctuations in labels. To be specific, 4 punctuations may appear in utterance's
text_tn
section, they are:<COMMA> <PERIOD> <QUESTIONMARK> <EXCLAMATIONPOINT>
- Grabage utterance tags in DEV/TEST sets:
our DEV/TEST sets are labelled by human annotators, they are instructed to label every single piece of the entire audio. So if part of audio is not human speech, they label it with a set of garbage utterance tags.
A complete table of garbage meta tags are listed below:
utterances with these garbage tags are not considered to be valid speech. So our recommendation is to discard these utterances in downstream training/testing. The reason why we keep these tags is to keep the integrity of human labels, so there is no "gap" inside DEV/TEST labels.
<SIL> <MUSIC> <NOISE> <OTHER>
To add data preparation support for a new toolkit, please follow
toolkits/kaldi/gigaspeech_data_prep.sh
and add similar scripts for your own
toolkit. For example, for ESPnet2, you would add
toolkits/espnet2/gigaspeech_data_prep.sh
to prepare the dataset, and all
other related scripts should be maintained under toolkits/espnet2
.