-
Notifications
You must be signed in to change notification settings - Fork 177
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🚀 Feature Request
Hey MosaicML team! Thank you so much for this awesome project! I was wondering if there are any plans to make this framework agnostic: Remove the dependency from PyTorch.
Motivation
The general idea of StreamingDataset is very useful and I believe the ML community in general will be more thrilled if we decouple this from PyTorch.
Implementation
Here are my thoughts on how we can go about this:
- The torch.utils.data.Dataset is a simple class with no dependencies with PyTorch (This is also true for the
IterableDataset) which can be very easily re-implemented here. - However this gets a bit challenging when porting the distributed.py file. However this is where the
CuPyproject comes to rescue. We can have seamless interoperability between CuPy, Jax, Tensorflow and PyTorch Tensors via thedl_packAPI with no copies. And most of the functions in thedistributed.pyfile have similar implementations in CuPy's distributed API. - As for the
StreamingDataLoaderwe can have this as an optional install if installing with PyTorch backend. - So my suggestion is if we use
CuPyinstead ofPyTorchwe can keep this framework neutral and also have 0 copy interoperability between Jax, TF and Torch.
Additional context
If made framework agnostic:
- This can be used with
tf.datapipelines which works well with Jax and Tensorflow. - Fits perfectly into
keras.utils.Sequencethis way we can also use it with Keras-3 which is compatible with TF/Jax/PyTorch backends.
Also I will be happy to extend my support on the same if you guys think this is a potential future direction!
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request