Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

Open
1 of 4 tasks
Tracked by #116
hahuyhoang411 opened this issue Nov 19, 2024 · 3 comments
Open
1 of 4 tasks
Tracked by #116
Assignees
Labels
P1: important Important feature / fix
Milestone

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 19, 2024

Goal

Create a speech instruction finetuning to make Ichigo better in conversation.

Tasklist

  • Gathering Vietnamese + English text instruction dataset.
  • Clean prompt in instruction dataset (e.g using distillabel)
  • Optimizing the pipeline
  • Draft research report
@hahuyhoang411 hahuyhoang411 changed the title Multi-lingual Instruct Speech Dataset Creation (Issue: ) task: Multi-lingual Instruct Speech Dataset Creation Nov 19, 2024
@hahuyhoang411 hahuyhoang411 added the P1: important Important feature / fix label Nov 19, 2024
@bachvudinh bachvudinh self-assigned this Nov 20, 2024
@bachvudinh
Copy link
Contributor

bachvudinh commented Nov 20, 2024

Base on my experience, I have concerns about the reliability of Text2Semantic. When I modified the T2S model parameters to stabilize the semantic tokens, it significantly increased the pipeline's processing time compared to the standard Text2Speech + Speech2Semantic pipeline without saving the audio. Therefore, I recommend we proceed with the T2S+ S2S pipeline approach. cc @tuanlda78202

@tuanlda78202
Copy link
Contributor

We can use viXTTS for speech synthesis, that's so good!

@github-project-automation github-project-automation bot moved this to Investigating in Jan & Cortex Nov 22, 2024
@tikikun tikikun moved this from Investigating to In Progress in Jan & Cortex Nov 25, 2024
@dan-menlo dan-menlo changed the title task: Multi-lingual Instruct Speech Dataset Creation task: Instruct Dataset Creation for Multilingual Speech Nov 27, 2024
@hahuyhoang411 hahuyhoang411 changed the title task: Instruct Dataset Creation for Multilingual Speech task: Instruct Dataset Creation for Multilingual Speech (Phase 2) Nov 27, 2024
@hahuyhoang411 hahuyhoang411 moved this from In Progress to Scheduled in Jan & Cortex Dec 1, 2024
@bachvudinh
Copy link
Contributor

bachvudinh commented Dec 12, 2024

Gather all vietnamese instruction data source here:

Data Source Number of Samples Note
Viettel x Nvidia dataset 4.5M instruct data with 55.9% CoT data, 25.7% QnA data and other.
Sailor2 dataset stage 1 TBD TBD
Sailor2 dataset stage 2 TBD TBD
Sailor2 dataset preference TBD TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1: important Important feature / fix
Projects
Status: Scheduled
Development

No branches or pull requests

3 participants