Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

bhimrazy · 2024-10-02T07:49:11Z

Issue Title: Improve CombinedStreamingDataset to handle multiple subdatasets efficiently

Description:

As raised by Emile Clastres, there is a performance issue when using CombinedStreamingDataset on hundreds of sub-datasets. The dataset crashes after a few batches, making it impractical for scenarios where multiple subdatasets need to be combined.

Initial User Inquiry:

"I have an S3 bucket in which I have hundreds of directories containing the output of LitData's optimize on some subset of my full data. This structure comes from the fact that each subdataset was processed independently on different machines and called its own optimize. I was very happy about this since it also allows me to combine subdatasets flexibly to create train/val/test splits.

However, it seems that CombinedStreamingDataset has terrible performance when used on hundreds of sub-datasets. It even crashes after a few batches have been yielded.

Suggested Solution:

As per @tchaton ' suggestion, We could also re-think the CombinedDataset to virtually re-combine the different index.json into one and decide how to fetch the chunks accordingly using the right remote path. This would be pretty interesting. Same performance as normal StreamingDataset but no need to combine them and create data copy.

The text was updated successfully, but these errors were encountered:

bhimrazy added the enhancement New feature or request label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

bhimrazy commented Oct 2, 2024

Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

Comments

bhimrazy commented Oct 2, 2024