Distributed Deep Learning Framework on Ray

The raytf framework provides a simple interface to support distributed training on ray, including tensorflow/pytorch/mxnet. Now tensorflow has been supported, others will be included in later.

Quick Start

Only tested under Python3.6 version

Install the latest ray version: pip install ray
Install the latest raytf: pip install raytf
Git clone this project: git clone https://github.com/zuston/raytf.git
Enter the example folder and execute the python script file, like the following command.

cd raytf
cd example
python mnist.py

How to Use

from raytf.raytf_driver import Driver
# When you using it in local single machine
# ray.init()
tf_cluster = Driver.build(resources=
    {
        'ps': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 2},
        'worker': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 6},
        'chief': {'cores': 2, 'memory': 2, 'gpu': 2, 'instances': 1}
    },
    event_log='/tmp/opal/4',
    resources_allocation_timeout=10
)
tf_cluster.start(model_process=process, args=None)

This training code will be attached to the existed on-prem Ray cluster. If debug, you can use ray.init() to init Ray cluster in local.

When you specify the event_log in tf builder, sidecar tensorboard will be started on one worker.

GANG scheduler has been supported. Besides raytf provides the configuration of timeout for waiting resources which is shown in above code, and the option of resources_allocation_timeout unit is sec.

How to build and deploy

<Requirement> python -m pip install twine

python setup.py bdist\_wheel --universal
python -m pip install xxxxxx.whl
twine upload dist/*

Tips

To solve the problem of Python module importing on Ray on-prem cluster, this project must use Ray 1.5+ version, refer to this RFC(ray-project/ray#14019)
This project is only be tested by Tensorflow estimator training

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
example		example
raytf		raytf
.gitignore		.gitignore
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Deep Learning Framework on Ray

Quick Start

How to Use

How to build and deploy

Tips

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

zuston/raytf

Folders and files

Latest commit

History

Repository files navigation

Distributed Deep Learning Framework on Ray

Quick Start

How to Use

How to build and deploy

Tips

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages