Skip to content

Making Streaming Dataset framework agnostic: Removing PyTorch dependency #551

@Abhijit-2592

Description

@Abhijit-2592

🚀 Feature Request

Hey MosaicML team! Thank you so much for this awesome project! I was wondering if there are any plans to make this framework agnostic: Remove the dependency from PyTorch.

Motivation

The general idea of StreamingDataset is very useful and I believe the ML community in general will be more thrilled if we decouple this from PyTorch.

Implementation

Here are my thoughts on how we can go about this:

  • The torch.utils.data.Dataset is a simple class with no dependencies with PyTorch (This is also true for the IterableDataset) which can be very easily re-implemented here.
  • However this gets a bit challenging when porting the distributed.py file. However this is where the CuPy project comes to rescue. We can have seamless interoperability between CuPy, Jax, Tensorflow and PyTorch Tensors via the dl_pack API with no copies. And most of the functions in the distributed.py file have similar implementations in CuPy's distributed API.
  • As for the StreamingDataLoader we can have this as an optional install if installing with PyTorch backend.
  • So my suggestion is if we use CuPy instead of PyTorch we can keep this framework neutral and also have 0 copy interoperability between Jax, TF and Torch.

Additional context

If made framework agnostic:

  • This can be used with tf.data pipelines which works well with Jax and Tensorflow.
  • Fits perfectly into keras.utils.Sequence this way we can also use it with Keras-3 which is compatible with TF/Jax/PyTorch backends.

Also I will be happy to extend my support on the same if you guys think this is a potential future direction!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions