You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Title: Improve CombinedStreamingDataset to handle multiple subdatasets efficiently
Description:
As raised by Emile Clastres, there is a performance issue when using CombinedStreamingDataset on hundreds of sub-datasets. The dataset crashes after a few batches, making it impractical for scenarios where multiple subdatasets need to be combined.
"I have an S3 bucket in which I have hundreds of directories containing the output of LitData's optimize on some subset of my full data. This structure comes from the fact that each subdataset was processed independently on different machines and called its own optimize. I was very happy about this since it also allows me to combine subdatasets flexibly to create train/val/test splits.
However, it seems that CombinedStreamingDataset has terrible performance when used on hundreds of sub-datasets. It even crashes after a few batches have been yielded.
Suggested Solution:
As per @tchaton ' suggestion, We could also re-think the CombinedDataset to virtually re-combine the different index.json into one and decide how to fetch the chunks accordingly using the right remote path. This would be pretty interesting. Same performance as normal StreamingDataset but no need to combine them and create data copy.
The text was updated successfully, but these errors were encountered:
Issue Title: Improve CombinedStreamingDataset to handle multiple subdatasets efficiently
Description:
As raised by Emile Clastres, there is a performance issue when using
CombinedStreamingDataset
on hundreds of sub-datasets. The dataset crashes after a few batches, making it impractical for scenarios where multiple subdatasets need to be combined.Initial User Inquiry:
Suggested Solution:
As per @tchaton ' suggestion, We could also re-think the CombinedDataset to virtually re-combine the different index.json into one and decide how to fetch the chunks accordingly using the right remote path. This would be pretty interesting. Same performance as normal StreamingDataset but no need to combine them and create data copy.
The text was updated successfully, but these errors were encountered: