integration of dataset #8

mshbaita-jo · 2024-01-31T08:08:34Z

The dataset you are using contains the audios and text in one data frame. But for me I have a folder that contains the audios with mp3 format and another TSV file that contains the names of the audios, the texts, and the speaker_id. How can I handle this dataset? and integrate it with the json configuration file?

ylacombe · 2024-02-13T11:20:45Z

Hey @DhanaTechAi, sorry for the wait, what I'd recommend is reading the tsv file, transform it into a dictionnary looking like that:

data_dict = {
"audio": LIST_OF_ABSOLUTE_PATH_TO_AUDIO,
"text": LIST_OF_TEXT,
"speaker_id": LIST_OF_SPEAKER_ID
}

Then do:

from datasets import DatasetDict, Audio

dataset = DatasetDict.from_dict(data_dict).convert_column("audio", Audio())
dataset.push_to_hub(REPO_ID)

Then you can use the configuration file replacing the data id with REPO_ID, and the audio and text column with "audio" and "text". Hope that helps

ylacombe mentioned this issue Feb 13, 2024

uploading dataset to hugging face #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration of dataset #8

integration of dataset #8

mshbaita-jo commented Jan 31, 2024

ylacombe commented Feb 13, 2024

integration of dataset #8

integration of dataset #8

Comments

mshbaita-jo commented Jan 31, 2024

ylacombe commented Feb 13, 2024