Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create a dataset with relative audio path #7313

Open
sedol1339 opened this issue Dec 9, 2024 · 3 comments
Open

Cannot create a dataset with relative audio path #7313

sedol1339 opened this issue Dec 9, 2024 · 3 comments

Comments

@sedol1339
Copy link

sedol1339 commented Dec 9, 2024

Describe the bug

Hello! I want to create a dataset of parquet files, with audios stored as separate .mp3 files. However, it says "No such file or directory" (see the reproducing code).

Steps to reproduce the bug

Creating a dataset

from pathlib import Path
from datasets import Dataset, load_dataset, Audio

Path('my_dataset/audio').mkdir(parents=True, exist_ok=True)
Path('my_dataset/audio/file.mp3').touch(exist_ok=True)
Dataset.from_list(
    [{'audio': {'path': 'audio/file.mp3'}}]
).to_parquet('my_dataset/data.parquet')

Result:

# my_dataset
# ├── audio
# │   └── file.mp3
# └── data.parquet

Trying to load the dataset

dataset = (
    load_dataset('my_dataset', split='train')
    .cast_column('audio', Audio(sampling_rate=16_000))
)
dataset[0]

>>> FileNotFoundError: [Errno 2] No such file or directory: 'audio/file.mp3'

Expected behavior

I expect the dataset to load correctly.

I've found 2 workarounds, but they are not very good:

  1. I can specify an absolute path to the audio, however, when I move the folder or upload to HF it will stop working.
  2. I can set 'path': 'file.mp3', and load with load_dataset('my_dataset', data_dir='audio') - it seems to work, but does this mean that anyone from Hugging Face who wants to use this dataset should also pass the data_dir argument, otherwise it won't work?

Environment info

datasets 3.1.0, Ubuntu 24.04.1

@lhoestq
Copy link
Member

lhoestq commented Dec 11, 2024

Hello ! when you cast_column you need the paths to be absolute paths or relative paths to your working directory, not the original dataset directory.

Though I'd recommend structuring your dataset as an AudioFolder which automatically links a metadata.jsonl or csv to the audio files via relative paths within the dataset repository: https://huggingface.co/docs/datasets/v3.2.0/en/audio_load#audiofolder

@sedol1339
Copy link
Author

sedol1339 commented Dec 11, 2024

@lhoestq thank you, but there are two problems with using AudioFolder:

  1. It is said that AudioFolder requires metadata.csv. However, my datset is too large and contains nested and np.ndarray fields, so I can't use csv.
  2. It is said that I need to load the dataset with load_dataset("audiofolder", ...). However, if possible, I want my dataset to be loaded as usual with load_dataset(dataset_name) after I upload if to HF.

@lhoestq
Copy link
Member

lhoestq commented Dec 12, 2024

You can use metadata.jsonl if you have nested data :)

And actually if you have a dataset structured as an AudioFolder then load_dataset(dataset_name) does work after uploading to HF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants