chore: Augmenting current Vietnamese speech dataset #148

hahuyhoang411 · 2024-12-12T08:49:26Z

Problem

Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500

or gigiaspeech2

Goal

Improve current dataset by adding back nature punctation

Draft solution

Use Whisperv3 large to transcribe audio in those dataset -> whisper_transcription_format
Use Llama3.2 8B to use the ground_truth and whisper_transcription_format to reformat the ground_truth
Note: We only use the structure of WhisperLarge not the label of it
Architecture:

Tasklist

Transcribe bud500, gigaspeech refined vi using Whisper large v3
Setup distil label pipeline
QA dataset

The text was updated successfully, but these errors were encountered:

bachvudinh · 2024-12-16T02:29:17Z

Need to discuss more on the pipeline cause:

result from Whisper large V3 is bad for bud500 audio.
Lllama tend to complete, modify the label instead of correcting typo only

hahuyhoang411 added the type: epic A major feature or initiative label Dec 12, 2024

hahuyhoang411 added this to the Ichigo v0.5 milestone Dec 12, 2024

hahuyhoang411 assigned tuanlda78202 Dec 12, 2024

hahuyhoang411 added this to Jan & Cortex Dec 12, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Dec 12, 2024

hahuyhoang411 assigned bachvudinh and unassigned tuanlda78202 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Augmenting current Vietnamese speech dataset #148

chore: Augmenting current Vietnamese speech dataset #148

hahuyhoang411 commented Dec 12, 2024 •

edited

Loading

bachvudinh commented Dec 16, 2024

chore: Augmenting current Vietnamese speech dataset #148

chore: Augmenting current Vietnamese speech dataset #148

Comments

hahuyhoang411 commented Dec 12, 2024 • edited Loading

Problem

Goal

Draft solution

Tasklist

bachvudinh commented Dec 16, 2024

hahuyhoang411 commented Dec 12, 2024 •

edited

Loading