`Dataset.save_to_disk` hangs when using num_proc > 1 #7290

JohannesAck · 2024-11-14T05:25:13Z

Describe the bug

Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, Dataset.save_to_disk is a lot slower when using num_proc>1 than when using num_proc=1

The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.

Steps to reproduce the bug

import numpy as np
from datasets import Dataset

n_samples = int(4e6)
n_tokens_sample = 100
data_dict = {
    'tokens' : np.random.randint(0, 100, (n_samples, n_tokens_sample)),
}

dataset = Dataset.from_dict(data_dict)
dataset.save_to_disk('test_dataset', num_proc=1)
dataset.save_to_disk('test_dataset', num_proc=4)
dataset.save_to_disk('test_dataset', num_proc=8)

This results in:

>>> dataset.save_to_disk('test_dataset', num_proc=1)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [00:17<00:00, 228075.15 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=4)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [01:49<00:00, 36583.75 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=8)
Saving the dataset (8/8 shards): 100%|██████████████| 4000000/4000000 [02:11<00:00, 30518.43 examples/s]

With larger datasets it can take hours, but I didn't benchmark that for this bug report.

Expected behavior

I would expect using num_proc>1 to be faster instead of slower than num_proc=1.

Environment info

datasets version: 3.1.0
Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.26.2
PyArrow version: 18.0.0
Pandas version: 2.2.3
fsspec version: 2024.6.1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290

JohannesAck commented Nov 14, 2024

Dataset.save_to_disk hangs when using num_proc > 1 #7290

Dataset.save_to_disk hangs when using num_proc > 1 #7290

Comments

JohannesAck commented Nov 14, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290