Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration of dataset #8

Open
mshbaita-jo opened this issue Jan 31, 2024 · 1 comment
Open

integration of dataset #8

mshbaita-jo opened this issue Jan 31, 2024 · 1 comment

Comments

@mshbaita-jo
Copy link

The dataset you are using contains the audios and text in one data frame. But for me I have a folder that contains the audios with mp3 format and another TSV file that contains the names of the audios, the texts, and the speaker_id. How can I handle this dataset? and integrate it with the json configuration file?

@ylacombe
Copy link
Owner

Hey @DhanaTechAi, sorry for the wait, what I'd recommend is reading the tsv file, transform it into a dictionnary looking like that:

data_dict = {
"audio": LIST_OF_ABSOLUTE_PATH_TO_AUDIO,
"text": LIST_OF_TEXT,
"speaker_id": LIST_OF_SPEAKER_ID
}

Then do:

from datasets import DatasetDict, Audio

dataset = DatasetDict.from_dict(data_dict).convert_column("audio", Audio())
dataset.push_to_hub(REPO_ID)

Then you can use the configuration file replacing the data id with REPO_ID, and the audio and text column with "audio" and "text". Hope that helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants