Cannot create a dataset with relative audio path #7313

sedol1339 · 2024-12-09T07:34:20Z

Describe the bug

Hello! I want to create a dataset of parquet files, with audios stored as separate .mp3 files. However, it says "No such file or directory" (see the reproducing code).

Steps to reproduce the bug

Creating a dataset

from pathlib import Path
from datasets import Dataset, load_dataset, Audio

Path('my_dataset/audio').mkdir(parents=True, exist_ok=True)
Path('my_dataset/audio/file.mp3').touch(exist_ok=True)
Dataset.from_list(
    [{'audio': {'path': 'audio/file.mp3'}}]
).to_parquet('my_dataset/data.parquet')

Result:

# my_dataset
# ├── audio
# │   └── file.mp3
# └── data.parquet

Trying to load the dataset

dataset = (
    load_dataset('my_dataset', split='train')
    .cast_column('audio', Audio(sampling_rate=16_000))
)
dataset[0]

>>> FileNotFoundError: [Errno 2] No such file or directory: 'audio/file.mp3'

Expected behavior

I expect the dataset to load correctly.

I've found 2 workarounds, but they are not very good:

I can specify an absolute path to the audio, however, when I move the folder or upload to HF it will stop working.
I can set 'path': 'file.mp3', and load with load_dataset('my_dataset', data_dir='audio') - it seems to work, but does this mean that anyone from Hugging Face who wants to use this dataset should also pass the data_dir argument, otherwise it won't work?

Environment info

datasets 3.1.0, Ubuntu 24.04.1

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-12-11T13:35:33Z

Hello ! when you cast_column you need the paths to be absolute paths or relative paths to your working directory, not the original dataset directory.

Though I'd recommend structuring your dataset as an AudioFolder which automatically links a metadata.jsonl or csv to the audio files via relative paths within the dataset repository: https://huggingface.co/docs/datasets/v3.2.0/en/audio_load#audiofolder

sedol1339 · 2024-12-11T17:43:58Z

@lhoestq thank you, but there are two problems with using AudioFolder:

It is said that AudioFolder requires metadata.csv. However, my datset is too large and contains nested and np.ndarray fields, so I can't use csv.
It is said that I need to load the dataset with load_dataset("audiofolder", ...). However, if possible, I want my dataset to be loaded as usual with load_dataset(dataset_name) after I upload if to HF.

lhoestq · 2024-12-12T13:46:37Z

You can use metadata.jsonl if you have nested data :)

And actually if you have a dataset structured as an AudioFolder then load_dataset(dataset_name) does work after uploading to HF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot create a dataset with relative audio path #7313

Cannot create a dataset with relative audio path #7313

sedol1339 commented Dec 9, 2024 •

edited

Loading

lhoestq commented Dec 11, 2024

sedol1339 commented Dec 11, 2024 •

edited

Loading

lhoestq commented Dec 12, 2024

Cannot create a dataset with relative audio path #7313

Cannot create a dataset with relative audio path #7313

Comments

sedol1339 commented Dec 9, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Dec 11, 2024

sedol1339 commented Dec 11, 2024 • edited Loading

lhoestq commented Dec 12, 2024

sedol1339 commented Dec 9, 2024 •

edited

Loading

sedol1339 commented Dec 11, 2024 •

edited

Loading