Generating train split takes a long time #7080

alexanderswerdlow · 2024-07-29T01:42:43Z

Describe the bug

Loading a simple webdataset takes ~45 minutes.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("PixArt-alpha/SAM-LLaVA-Captions10M")

Expected behavior

The dataset should load immediately as it does when loaded through a normal indexed WebDataset loader. Generating splits should be optional and there should be a message showing how to disable it.

Environment info

datasets version: 2.20.0
Platform: Linux-4.18.0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.28
Python version: 3.10.14
huggingface_hub version: 0.24.1
PyArrow version: 16.1.0
Pandas version: 2.2.2
fsspec version: 2024.5.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating train split takes a long time #7080

Generating train split takes a long time #7080

alexanderswerdlow commented Jul 29, 2024

Generating train split takes a long time #7080

Generating train split takes a long time #7080

Comments

alexanderswerdlow commented Jul 29, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info