Best way to integrate ray as Distributed Backend #1325

igor-krawczuk · 2020-04-01T12:14:33Z

igor-krawczuk
Apr 1, 2020

I want to use ray as a distributed backend, mainly to make use of their autoscaling capabilities. Since I'd like to upstream my solution afterwards, and also get some feedback, I'd like to check with y'all first.

After reading the docs on both sides and the current implementation of LightningDistributedDataParallel I think there are a two main options:

Follow this guide and implement a tweaked LightningDistributedDataParallel, letting pytorch-lightning handle the training
Use the LightningModules functions to instantiate a TorchTrainer instance and let ray handle the training loops.
Merge pytorch_lightning.Trainer and TorchTrainer?

Currently I'm leaning towards option 1, Option 2 or 3 would give ray.tune's hyperparameter search for free, but it seem like quite a lot work.

igor-krawczuk · 2020-04-01T21:16:43Z

igor-krawczuk
Apr 1, 2020
Author

Alrighty, after a bunch more reading it turned out tweaking pytorch_lightning.Trainer and creating a ray wrapper was easier than expected. You can see a basic API here, it doesn't yet fit the current pytorch_lightning coding style though (I'm more used to component based development rather than mixins).

The changes can be seen here, basically I added a check for slurm wherever it was making use of a slurm based environment variable, then created a two stage wrapper, a "RemoteRayTrainer" which runs on the remote nodes and sets an "LIGHTNING_X" equivalent wherever there was a "SLURM_X" used before, and a RayTrainer which spawns the remote workers. I have not tested this on an actual cluster so far though, I'll need to check if it plays nicely when actually going over the network and not only localhost.

One big-ish problem I see is that the decorator didn't like being used dynamically, which means I couldn't automatically set the num_gpus requirement.

This is enough for my own purposes if it works, but as I said I'd love to upstream this if desired. Feel free to request changes at will :-)

0 replies

richardliaw · 2020-04-02T04:24:55Z

richardliaw
Apr 2, 2020

Hey, this is great! I'd be happy to move this upstream into Ray too, if we agree on a design.

You can use .options() to dynamically set resources for a Ray remote worker:

@ray.remote(num_gpus=1, max_calls=1, num_return_vals=2)
def f():
    return 1, 2
g = f.options(num_gpus=2, max_calls=None)
g.remote() # uses 2 gpus

Reference: https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote

Happy to chat more offline!

0 replies

williamFalcon · 2020-04-02T12:43:09Z

williamFalcon
Apr 2, 2020
Maintainer

this is awesome! would love upstream support for our lightning users!

0 replies

renat-abbyazov · 2020-04-08T08:46:01Z

renat-abbyazov
Apr 8, 2020

Ray announced RaySGD
https://medium.com/distributed-computing-with-ray/faster-and-cheaper-pytorch-with-raysgd-a5a44d4fd220
It would be great to have an opportunity to train on preemptible instances.

0 replies

igor-krawczuk · 2020-04-08T13:59:26Z

igor-krawczuk
Apr 8, 2020
Author

A small update on this: my forked repo now contains a version of raytrainer that can directly connect to a ray cluster and use both ddp and ddp2 to train a model, I have to see how it interacts with callbacks now. Thanks to @richardliaw for the pointer to ".options"

0 replies

richardliaw · 2020-07-11T22:19:24Z

richardliaw
Jul 11, 2020

hey @williamFalcon, @Borda, are you all still interested in this? any idea of what's needed to make this happen?

would be more than happy to help out if given some pointers!

0 replies

williamFalcon · 2020-07-11T22:32:21Z

williamFalcon
Jul 11, 2020
Maintainer

hey! thanks for asking. Right now we have a feature freeze for new features until 1.0.0 (coming in the next few months).

We just want to stabilize the package before adding new things :)

We’ll ping you once that’s ready to discuss how to do this.

0 replies

dukeeagle · 2020-10-01T21:38:27Z

dukeeagle
Oct 1, 2020

Hi @williamFalcon and co! Highly interested in this as a Ray user myself. Is there any movement in this direction assuming that version 1.0.0 is arriving soon?

0 replies

williamFalcon · 2020-10-01T21:51:11Z

williamFalcon
Oct 1, 2020
Maintainer

we’re almost done with 1.0. We can look into this afterwards!

thanks for mentioning this.

0 replies

sumanthratna · 2020-10-01T22:40:36Z

sumanthratna
Oct 1, 2020

@amogkam has been working on integrating LightningModules with RaySGD: ray-project/ray#11042. After this is merged into Ray, this will provide a way to use Ray for training with PTL. If you need hparam optimization, take a look at Using PyTorch Lightning with Tune

0 replies

richardliaw · 2020-11-02T04:49:14Z

richardliaw
Nov 2, 2020

Hi @williamFalcon, any update on this? We'd be more than happy to explore an integration!

0 replies

jzazo · 2021-01-23T23:53:00Z

jzazo
Jan 23, 2021

Hi! This integration looks amazing!
Just a gentle ping to see if this discussion recovers traction. Best regards to all :)

1 reply

amogkam Jan 24, 2021

Hey @jzazo, this is very timely and we’ve gotten multiple feature requests for this recently. We plan to implement a custom Pytorch Lightning accelerator using Ray next week, and will be hosted on the Ray side. Stay tuned for this, and I will update here once it is finished!

amogkam · 2021-02-11T07:53:19Z

amogkam
Feb 11, 2021

Hi all, we just finished implementing a Ray backend for distributed Pytorch Lightning training here- https://github.com/ray-project/ray_lightning_accelerators.

The package introduces 2 new Pytorch Lightning accelerators for both DDP and Horovod training on Ray for quick and easy distributed training. It also integrates with Ray Tune for distributed hyperparameter tuning.

Please check it out, and would love to hear any feedback 🙂

cc @igor-krawczuk @jzazo @dukeeagle

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to integrate ray as Distributed Backend #1325

{{title}}

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Best way to integrate ray as Distributed Backend #1325

Replies: 13 comments

igor-krawczuk Apr 1, 2020 Author

williamFalcon Apr 2, 2020 Maintainer

igor-krawczuk Apr 8, 2020 Author

williamFalcon Jul 11, 2020 Maintainer

williamFalcon Oct 1, 2020 Maintainer

igor-krawczuk
Apr 1, 2020
Author

williamFalcon
Apr 2, 2020
Maintainer

igor-krawczuk
Apr 8, 2020
Author

williamFalcon
Jul 11, 2020
Maintainer

williamFalcon
Oct 1, 2020
Maintainer