Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.save_to_disk hangs when using num_proc > 1 #7290

Open
JohannesAck opened this issue Nov 14, 2024 · 0 comments
Open

Dataset.save_to_disk hangs when using num_proc > 1 #7290

JohannesAck opened this issue Nov 14, 2024 · 0 comments

Comments

@JohannesAck
Copy link

Describe the bug

Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, Dataset.save_to_disk is a lot slower when using num_proc>1 than when using num_proc=1

The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.

Steps to reproduce the bug

import numpy as np
from datasets import Dataset

n_samples = int(4e6)
n_tokens_sample = 100
data_dict = {
    'tokens' : np.random.randint(0, 100, (n_samples, n_tokens_sample)),
}

dataset = Dataset.from_dict(data_dict)
dataset.save_to_disk('test_dataset', num_proc=1)
dataset.save_to_disk('test_dataset', num_proc=4)
dataset.save_to_disk('test_dataset', num_proc=8)

This results in:

>>> dataset.save_to_disk('test_dataset', num_proc=1)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [00:17<00:00, 228075.15 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=4)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [01:49<00:00, 36583.75 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=8)
Saving the dataset (8/8 shards): 100%|██████████████| 4000000/4000000 [02:11<00:00, 30518.43 examples/s]

With larger datasets it can take hours, but I didn't benchmark that for this bug report.

Expected behavior

I would expect using num_proc>1 to be faster instead of slower than num_proc=1.

Environment info

  • datasets version: 3.1.0
  • Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.26.2
  • PyArrow version: 18.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.6.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant