Best way to integrate ray as Distributed Backend #1325
Replies: 13 comments
-
Alrighty, after a bunch more reading it turned out tweaking pytorch_lightning.Trainer and creating a ray wrapper was easier than expected. You can see a basic API here, it doesn't yet fit the current pytorch_lightning coding style though (I'm more used to component based development rather than mixins). The changes can be seen here, basically I added a check for slurm wherever it was making use of a slurm based environment variable, then created a two stage wrapper, a "RemoteRayTrainer" which runs on the remote nodes and sets an "LIGHTNING_X" equivalent wherever there was a " One big-ish problem I see is that the decorator didn't like being used dynamically, which means I couldn't automatically set the This is enough for my own purposes if it works, but as I said I'd love to upstream this if desired. Feel free to request changes at will :-) |
Beta Was this translation helpful? Give feedback.
-
Hey, this is great! I'd be happy to move this upstream into Ray too, if we agree on a design. You can use @ray.remote(num_gpus=1, max_calls=1, num_return_vals=2)
def f():
return 1, 2
g = f.options(num_gpus=2, max_calls=None)
g.remote() # uses 2 gpus Reference: https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote Happy to chat more offline! |
Beta Was this translation helpful? Give feedback.
-
this is awesome! would love upstream support for our lightning users! |
Beta Was this translation helpful? Give feedback.
-
Ray announced RaySGD |
Beta Was this translation helpful? Give feedback.
-
A small update on this: my forked repo now contains a version of raytrainer that can directly connect to a ray cluster and use both ddp and ddp2 to train a model, I have to see how it interacts with callbacks now. Thanks to @richardliaw for the pointer to ".options" |
Beta Was this translation helpful? Give feedback.
-
hey @williamFalcon, @Borda, are you all still interested in this? any idea of what's needed to make this happen? would be more than happy to help out if given some pointers! |
Beta Was this translation helpful? Give feedback.
-
hey! thanks for asking. Right now we have a feature freeze for new features until 1.0.0 (coming in the next few months). We just want to stabilize the package before adding new things :) We’ll ping you once that’s ready to discuss how to do this. |
Beta Was this translation helpful? Give feedback.
-
Hi @williamFalcon and co! Highly interested in this as a Ray user myself. Is there any movement in this direction assuming that version 1.0.0 is arriving soon? |
Beta Was this translation helpful? Give feedback.
-
we’re almost done with 1.0. We can look into this afterwards! thanks for mentioning this. |
Beta Was this translation helpful? Give feedback.
-
@amogkam has been working on integrating |
Beta Was this translation helpful? Give feedback.
-
Hi @williamFalcon, any update on this? We'd be more than happy to explore an integration! |
Beta Was this translation helpful? Give feedback.
-
Hi! This integration looks amazing! |
Beta Was this translation helpful? Give feedback.
-
Hi all, we just finished implementing a Ray backend for distributed Pytorch Lightning training here- https://github.com/ray-project/ray_lightning_accelerators. The package introduces 2 new Pytorch Lightning accelerators for both DDP and Horovod training on Ray for quick and easy distributed training. It also integrates with Ray Tune for distributed hyperparameter tuning. Please check it out, and would love to hear any feedback 🙂 |
Beta Was this translation helpful? Give feedback.
-
I want to use ray as a distributed backend, mainly to make use of their autoscaling capabilities. Since I'd like to upstream my solution afterwards, and also get some feedback, I'd like to check with y'all first.
After reading the docs on both sides and the current implementation of LightningDistributedDataParallel I think there are a two main options:
Currently I'm leaning towards option 1, Option 2 or 3 would give ray.tune's hyperparameter search for free, but it seem like quite a lot work.
Beta Was this translation helpful? Give feedback.
All reactions