Table of Contents
This package assists in generating training data for fine-tuning Whisper by synthesizing .srt files from sentences, mimicking real data through sentence concatenation.
- Data File (.tsv):
- Create a
.tsv
file with two required columns:path
: The relative path to the.mp3
file.sentence
: The text corresponding to the audio file.
- Optional: If a
client_id
is included, it can be used to increase the probability that following sentences are from the same speaker. Refer togenerate_fold
insrc/whisper_prep/generation/generate.py
for additional features.
- Create a
1a. Timestamp-based TSV (.tsv):
- Create a
.tsv
file with four required columns:srt_path
: Path to the.srt
file containing subtitles.language
: ISO language code for the subtitles (e.g.,de
,en
).id
: Unique identifier for the audio/transcript pair.audio_path
: Path to the corresponding.mp3
file.
- This TSV can be used to process existing SRT transcripts and audio files without directory globbing.
-
Configuration File (.yaml):
-
Set up a
.yaml
configuration file. An example can be found atexample.yaml
. -
(Optional) To load data directly from a HuggingFace dataset with
audio
andsrt
columns, set thehu_dataset
field to the dataset identifier; this will bypass TSV-based generation and process existing subtitles. For sentence-based datasets without ansrt
column, synthetic SRT files will be generated from the sentences. -
(Optional) To process existing SRT files and audio paths without directory globbing, specify a TSV via
transcripts_tsv
. The TSV must include columnssrt_path
,audio_path
,language
, andid
to map each transcript to its audio file and language.
-
-
Running the Generation Script:
- Run
whisper_prep -c <path_to_your_yaml_file>
.
- Run
-
Upload a TSV as an ASR Dataset:
- A helper script
upload_asr_dataset.py
can convert a.tsv
file (with at leastpath
andsentence
columns) into a Hugging Face ASR dataset and push it to the Hub:python upload_asr_dataset.py --tsv path/to/data.tsv \ --repo_id username/dataset_name --split train
- A helper script
-
Upload to Huggingface.com:
Vincenzo Timmel - [email protected]
Distributed under the MIT License. See LICENSE
for more information.