Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Augmenting current Vietnamese speech dataset #148

Open
3 tasks
hahuyhoang411 opened this issue Dec 12, 2024 · 1 comment
Open
3 tasks

chore: Augmenting current Vietnamese speech dataset #148

hahuyhoang411 opened this issue Dec 12, 2024 · 1 comment
Assignees
Labels
type: epic A major feature or initiative
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Dec 12, 2024

Problem

Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500

Screenshot 2024-12-12 at 09 32 31

or gigiaspeech2
Screenshot 2024-12-12 at 09 37 30

Goal

Improve current dataset by adding back nature punctation

Draft solution

  • Use Whisperv3 large to transcribe audio in those dataset -> whisper_transcription_format

  • Use Llama3.2 8B to use the ground_truth and whisper_transcription_format to reformat the ground_truth

  • Note: We only use the structure of WhisperLarge not the label of it

  • Architecture:

Screenshot 2024-12-12 at 09 48 12

Tasklist

  • Transcribe bud500, gigaspeech refined vi using Whisper large v3
  • Setup distil label pipeline
  • QA dataset
@hahuyhoang411 hahuyhoang411 added the type: epic A major feature or initiative label Dec 12, 2024
@hahuyhoang411 hahuyhoang411 added this to the Ichigo v0.5 milestone Dec 12, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Dec 12, 2024
@bachvudinh
Copy link
Contributor

Need to discuss more on the pipeline cause:

  • result from Whisper large V3 is bad for bud500 audio.
  • Lllama tend to complete, modify the label instead of correcting typo only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: epic A major feature or initiative
Projects
Status: Investigating
Development

No branches or pull requests

3 participants