You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset of audio files and associated metadata. Due to the complex structure of the dataset, I needed to write custom code to build the dataset in the 🤗 Datasets format. However, I still want to keep the original audio files in the repository for easy access.
If I just push the dataset to the Hub using push_to_hub(), the data will be unnecessarily duplicated: one copy as raw audio files, one copy in Parquet files (and potentially >1 if there are configs that share files). I thought I could avoid this by setting embed_external_files=False. However, this way, the dataset still references the local files on my machine, and hence fails to load if they're not around.
Is there a way to make the dataset reference the audio files in the repo (in an audio/ directory in the root of the repo) without embedding them, and still have it load successfully with load_dataset() and have a working dataset viewer?
The way I could imagine this working is that the main branch would have the lightweight version of the Parquet file (only referencing the audio by relative path) and then the Parquet converter would create a "complete" version with the audio embedded and put this in the refs/convert/parquet branch.
If this is not possible, I'm wondering what is the use case for embed_external_files=False, since it will result in a broken dataset being uploaded to Hub.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a dataset of audio files and associated metadata. Due to the complex structure of the dataset, I needed to write custom code to build the dataset in the 🤗 Datasets format. However, I still want to keep the original audio files in the repository for easy access.
If I just push the dataset to the Hub using
push_to_hub()
, the data will be unnecessarily duplicated: one copy as raw audio files, one copy in Parquet files (and potentially >1 if there are configs that share files). I thought I could avoid this by settingembed_external_files=False
. However, this way, the dataset still references the local files on my machine, and hence fails to load if they're not around.Is there a way to make the dataset reference the audio files in the repo (in an
audio/
directory in the root of the repo) without embedding them, and still have it load successfully withload_dataset()
and have a working dataset viewer?The way I could imagine this working is that the
main
branch would have the lightweight version of the Parquet file (only referencing the audio by relative path) and then the Parquet converter would create a "complete" version with the audio embedded and put this in therefs/convert/parquet
branch.If this is not possible, I'm wondering what is the use case for
embed_external_files=False
, since it will result in a broken dataset being uploaded to Hub.Beta Was this translation helpful? Give feedback.
All reactions