You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500
Problem
Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500
or gigiaspeech2
Goal
Improve current dataset by adding back nature punctation
Draft solution
Use Whisperv3 large to transcribe audio in those dataset ->
whisper_transcription_format
Use Llama3.2 8B to use the
ground_truth
andwhisper_transcription_format
to reformat theground_truth
Note: We only use the structure of WhisperLarge not the label of it
Architecture:
Tasklist
The text was updated successfully, but these errors were encountered: