task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

hahuyhoang411 · 2024-11-19T16:56:15Z

Goal

Create a speech instruction finetuning to make Ichigo better in conversation.

Tasklist

Gathering Vietnamese + English text instruction dataset.
Clean prompt in instruction dataset (e.g using distillabel)
Optimizing the pipeline
Draft research report

bachvudinh · 2024-11-20T02:55:27Z

Base on my experience, I have concerns about the reliability of Text2Semantic. When I modified the T2S model parameters to stabilize the semantic tokens, it significantly increased the pipeline's processing time compared to the standard Text2Speech + Speech2Semantic pipeline without saving the audio. Therefore, I recommend we proceed with the T2S+ S2S pipeline approach. cc @tuanlda78202

tuanlda78202 · 2024-11-20T16:25:31Z

We can use viXTTS for speech synthesis, that's so good!

bachvudinh · 2024-12-12T20:24:37Z

Gather all vietnamese instruction data source here:

Data Source	Number of Samples	Note
Viettel x Nvidia dataset	4.5M	instruct data with 55.9% CoT data, 25.7% QnA data and other.
Sailor2 dataset stage 1	TBD	TBD
Sailor2 dataset stage 2	TBD	TBD
Sailor2 dataset preference	TBD	TBD

hahuyhoang411 mentioned this issue Nov 19, 2024

milestone: Ichigo v0.5 Multi-lingual #116

Open

7 tasks

hahuyhoang411 changed the title ~~Multi-lingual Instruct Speech Dataset Creation (Issue: )~~ task: Multi-lingual Instruct Speech Dataset Creation Nov 19, 2024

hahuyhoang411 assigned tuanlda78202 Nov 19, 2024

hahuyhoang411 added the P1: important Important feature / fix label Nov 19, 2024

bachvudinh self-assigned this Nov 20, 2024

hiento09 added this to Jan & Cortex Nov 22, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Nov 22, 2024

tikikun moved this from Investigating to In Progress in Jan & Cortex Nov 25, 2024

hahuyhoang411 added this to the Ichigo v0.5 - Multilingual milestone Nov 25, 2024

dan-menlo changed the title ~~task: Multi-lingual Instruct Speech Dataset Creation~~ task: Instruct Dataset Creation for Multilingual Speech Nov 27, 2024

hahuyhoang411 changed the title ~~task: Instruct Dataset Creation for Multilingual Speech~~ task: Instruct Dataset Creation for Multilingual Speech (Phase 2) Nov 27, 2024

hahuyhoang411 moved this from In Progress to Scheduled in Jan & Cortex Dec 1, 2024

hahuyhoang411 assigned hahuyhoang411 and unassigned tuanlda78202 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

tuanlda78202 commented Nov 20, 2024

bachvudinh commented Dec 12, 2024 •

edited by hahuyhoang411

Loading

task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

task: Instruct Dataset Creation for Multilingual Speech (Phase 2) #121

Comments

hahuyhoang411 commented Nov 19, 2024 • edited by bachvudinh Loading

Goal

Tasklist

bachvudinh commented Nov 20, 2024 • edited Loading

tuanlda78202 commented Nov 20, 2024

bachvudinh commented Dec 12, 2024 • edited by hahuyhoang411 Loading

hahuyhoang411 commented Nov 19, 2024 •

edited by bachvudinh

Loading

bachvudinh commented Nov 20, 2024 •

edited

Loading

bachvudinh commented Dec 12, 2024 •

edited by hahuyhoang411

Loading