You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, Dataset.save_to_disk is a lot slower when using num_proc>1 than when using num_proc=1
The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.
Describe the bug
Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically,
Dataset.save_to_disk
is a lot slower when usingnum_proc>1
than when usingnum_proc=1
The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.
Steps to reproduce the bug
This results in:
With larger datasets it can take hours, but I didn't benchmark that for this bug report.
Expected behavior
I would expect using
num_proc>1
to be faster instead of slower thannum_proc=1
.Environment info
datasets
version: 3.1.0huggingface_hub
version: 0.26.2fsspec
version: 2024.6.1The text was updated successfully, but these errors were encountered: