Pushing to Hub without embedding "external" files #7338

cifkao · 2024-12-17T13:02:38Z

cifkao
Dec 17, 2024

I have a dataset of audio files and associated metadata. Due to the complex structure of the dataset, I needed to write custom code to build the dataset in the 🤗 Datasets format. However, I still want to keep the original audio files in the repository for easy access.

If I just push the dataset to the Hub using push_to_hub(), the data will be unnecessarily duplicated: one copy as raw audio files, one copy in Parquet files (and potentially >1 if there are configs that share files). I thought I could avoid this by setting embed_external_files=False. However, this way, the dataset still references the local files on my machine, and hence fails to load if they're not around.

Is there a way to make the dataset reference the audio files in the repo (in an audio/ directory in the root of the repo) without embedding them, and still have it load successfully with load_dataset() and have a working dataset viewer?

The way I could imagine this working is that the main branch would have the lightweight version of the Parquet file (only referencing the audio by relative path) and then the Parquet converter would create a "complete" version with the audio embedded and put this in the refs/convert/parquet branch.

If this is not possible, I'm wondering what is the use case for embed_external_files=False, since it will result in a broken dataset being uploaded to Hub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pushing to Hub without embedding "external" files #7338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Pushing to Hub without embedding "external" files #7338

cifkao Dec 17, 2024

Replies: 0 comments

cifkao
Dec 17, 2024