-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset Structure
Karn Watcharasupat edited this page Jul 8, 2024
·
3 revisions
.
└── multi/
├── audio/
│ ├── train/
│ │ └── {clip-id}/
│ │ ├── speech.flac
│ │ ├── music.flac
│ │ ├── sfx.flac
│ │ ├── sfx_fg.flac
│ │ ├── sfx_bg.flac
│ │ └── mixture.flac
│ ├── val/
│ │ └── {clip-id}/
│ │ └── ...
│ └── test/
│ └── {clip-id}/
│ └── ...
├── manifest/
│ ├── train/
│ │ └── {clip-id}/
│ │ ├── speech.csv
│ │ ├── music.csv
│ │ ├── sfx_fg.csv
│ │ └── sfx_bg.csv
│ ├── val/
│ │ └── {clip-id}/
│ │ └── ...
│ └── test/
│ └── {clip-id}/
│ └── ...
└── audio_metadata/
├── train/
│ └── {clip-id}.csv
├── val/
│ └── {clip-id}.csv
└── test/
└── {clip-id}.csv
The audio files are mono and 60 seconds in duration. All files are sampled at 48 kHz with a bit depth of 24 bits. The audio files are provided in lossless FLAC format to reduce the archive size. You can use
ffmpeg -i input.flac -c:a pcm_s24le output.wav
to convert the audio files back to wav.
The manifest files are CSV files with each row representing a sound event. Each CSV contains the following columns
-
file
: path to the raw audio event -
start_sample
,start_seconds
: start time relative to the track -
length_sample
,length_seconds
: duration relative to the track -
end_seconds
: end time relative to the track -
segment_start_sampl
: start time relative to the raw file -
lufs
: Nominal event loudness in LKFS. -
submix_lufs
: Actual track loudness in LKFS (same across all rows) -
submix_lufs_target
: Nominal track loudness in LKFS (same across all rows)
Example
file,start_sample,length_sample,segment_start_sample,start_seconds,length_seconds,end_seconds,lufs,submix_lufs,submix_lufs_target
speech-kazakh-slr140/audio/full/48k/test/878_188.wav,0,374976,0,0.0,7.812,7.812,-20.631814741589345,-20.3484730207178,-20.3484730207178
speech-yoruba-slr86-google/audio/full/48k/test/yom_02484_01663235147.wav,364685,184320,0,7.597604166666667,3.84,11.437604166666667,-28.772182172953954,-20.3484730207178,-20.3484730207178
speech-indic-slr-google/audio/full/48k/test/ban_02194_00413042161.wav,678539,192512,0,14.136229166666666,4.010666666666666,18.146895833333332,-26.709850320179136,-20.3484730207178,-20.3484730207178
speech-english-slr12-librispeech-hq/audio/clean-100h/48k/test/7021/79730/7021-79730-0007.wav,1005221,598080,0,20.942104166666667,12.46,33.40210416666667,-31.115058841609176,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB08170448.wav,1732212,145445,0,36.08775,3.030104166666667,39.11785416666667,-24.0928698524041,-20.3484730207178,-20.3484730207178
speech-english-slr83-google-british-isles/audio/full/48k/test/nom_07508_01121578934.wav,1871093,323584,0,38.98110416666667,6.741333333333333,45.7224375,-27.721156680891227,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB13400390.wav,2160552,242309,0,45.0115,5.048104166666667,50.05960416666667,-13.796885474819632,-20.3484730207178,-20.3484730207178
speech-indic-slr-google/audio/full/48k/test/mrt_04310_01923290054.wav,2383229,417792,0,49.65060416666667,8.704,58.35460416666667,-21.24941089227363,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB07360485.wav,2795026,73561,0,58.229708333333335,1.5325208333333333,59.762229166666664,-30.01494917037914,-20.3484730207178,-20.3484730207178
The audio metadata file lists the loudness and peak information for each stem.
Example
,loudness_integrated,true_peak,naive_peak
speech,-25.499287264933606,-5.192297023017089,-5.186877826519533
music,-36.3792485121433,-18.407167845133067,-18.407241622817892
sfx_fg,-30.47502611345025,-1.9941356820023546,-1.9990573851479834
sfx_bg,-44.26378922558933,-14.588409210045402,-14.586635164569977
mixture,-25.359276162635503,-1.1463566178057358,-1.230612943333371