Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CombinedStreamingDataset to handle multiple subdatasets efficiently #386

Open
bhimrazy opened this issue Oct 2, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@bhimrazy
Copy link
Collaborator

bhimrazy commented Oct 2, 2024

Issue Title: Improve CombinedStreamingDataset to handle multiple subdatasets efficiently

Description:

As raised by Emile Clastres, there is a performance issue when using CombinedStreamingDataset on hundreds of sub-datasets. The dataset crashes after a few batches, making it impractical for scenarios where multiple subdatasets need to be combined.

Initial User Inquiry:

"I have an S3 bucket in which I have hundreds of directories containing the output of LitData's optimize on some subset of my full data. This structure comes from the fact that each subdataset was processed independently on different machines and called its own optimize. I was very happy about this since it also allows me to combine subdatasets flexibly to create train/val/test splits.

However, it seems that CombinedStreamingDataset has terrible performance when used on hundreds of sub-datasets. It even crashes after a few batches have been yielded.

Suggested Solution:

As per @tchaton ' suggestion, We could also re-think the CombinedDataset to virtually re-combine the different index.json into one and decide how to fetch the chunks accordingly using the right remote path. This would be pretty interesting. Same performance as normal StreamingDataset but no need to combine them and create data copy.

@bhimrazy bhimrazy added the enhancement New feature or request label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant