Skip to content

Dataset Structure

Karn Watcharasupat edited this page Jul 8, 2024 · 3 revisions

Folder Structure

.
└── multi/
    ├── audio/
    │   ├── train/
    │   │   └── {clip-id}/
    │   │       ├── speech.flac
    │   │       ├── music.flac
    │   │       ├── sfx.flac
    │   │       ├── sfx_fg.flac
    │   │       ├── sfx_bg.flac
    │   │       └── mixture.flac
    │   ├── val/
    │   │   └── {clip-id}/
    │   │       └── ...
    │   └── test/
    │       └── {clip-id}/
    │           └── ...
    ├── manifest/
    │   ├── train/
    │   │   └── {clip-id}/
    │   │       ├── speech.csv
    │   │       ├── music.csv
    │   │       ├── sfx_fg.csv
    │   │       └── sfx_bg.csv
    │   ├── val/
    │   │   └── {clip-id}/
    │   │       └── ...
    │   └── test/
    │       └── {clip-id}/
    │           └── ...
    └── audio_metadata/
        ├── train/
        │   └── {clip-id}.csv
        ├── val/
        │   └── {clip-id}.csv
        └── test/
            └── {clip-id}.csv

Audio Files

The audio files are mono and 60 seconds in duration. All files are sampled at 48 kHz with a bit depth of 24 bits. The audio files are provided in lossless FLAC format to reduce the archive size. You can use

ffmpeg -i input.flac -c:a pcm_s24le output.wav

to convert the audio files back to wav.

Manifests

The manifest files are CSV files with each row representing a sound event. Each CSV contains the following columns

  • file: path to the raw audio event
  • start_sample, start_seconds: start time relative to the track
  • length_sample, length_seconds: duration relative to the track
  • end_seconds: end time relative to the track
  • segment_start_sampl: start time relative to the raw file
  • lufs: Nominal event loudness in LKFS.
  • submix_lufs: Actual track loudness in LKFS (same across all rows)
  • submix_lufs_target: Nominal track loudness in LKFS (same across all rows)

Example

file,start_sample,length_sample,segment_start_sample,start_seconds,length_seconds,end_seconds,lufs,submix_lufs,submix_lufs_target
speech-kazakh-slr140/audio/full/48k/test/878_188.wav,0,374976,0,0.0,7.812,7.812,-20.631814741589345,-20.3484730207178,-20.3484730207178
speech-yoruba-slr86-google/audio/full/48k/test/yom_02484_01663235147.wav,364685,184320,0,7.597604166666667,3.84,11.437604166666667,-28.772182172953954,-20.3484730207178,-20.3484730207178
speech-indic-slr-google/audio/full/48k/test/ban_02194_00413042161.wav,678539,192512,0,14.136229166666666,4.010666666666666,18.146895833333332,-26.709850320179136,-20.3484730207178,-20.3484730207178
speech-english-slr12-librispeech-hq/audio/clean-100h/48k/test/7021/79730/7021-79730-0007.wav,1005221,598080,0,20.942104166666667,12.46,33.40210416666667,-31.115058841609176,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB08170448.wav,1732212,145445,0,36.08775,3.030104166666667,39.11785416666667,-24.0928698524041,-20.3484730207178,-20.3484730207178
speech-english-slr83-google-british-isles/audio/full/48k/test/nom_07508_01121578934.wav,1871093,323584,0,38.98110416666667,6.741333333333333,45.7224375,-27.721156680891227,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB13400390.wav,2160552,242309,0,45.0115,5.048104166666667,50.05960416666667,-13.796885474819632,-20.3484730207178,-20.3484730207178
speech-indic-slr-google/audio/full/48k/test/mrt_04310_01923290054.wav,2383229,417792,0,49.65060416666667,8.704,58.35460416666667,-21.24941089227363,-20.3484730207178,-20.3484730207178
speech-chinese-slr93-aishell3/audio/full/48k/test/SSB07360485.wav,2795026,73561,0,58.229708333333335,1.5325208333333333,59.762229166666664,-30.01494917037914,-20.3484730207178,-20.3484730207178

Audio Metadata

The audio metadata file lists the loudness and peak information for each stem.

Example

,loudness_integrated,true_peak,naive_peak
speech,-25.499287264933606,-5.192297023017089,-5.186877826519533
music,-36.3792485121433,-18.407167845133067,-18.407241622817892
sfx_fg,-30.47502611345025,-1.9941356820023546,-1.9990573851479834
sfx_bg,-44.26378922558933,-14.588409210045402,-14.586635164569977
mixture,-25.359276162635503,-1.1463566178057358,-1.230612943333371
Clone this wiki locally