Implement distributed training using horovod #1865

NanoNabla · 2021-05-05T21:36:08Z

I implemented distributed training using Horovod similar to the one that made it already into DeepSpeech.
I already opened a discussion #1849 a while ago if this feature is wanted by you but there isn't any answer, yet.

I tried to keep the changes as minimal as possible. It is still possible to run your undistributed code version. However I also noticed a slightly performance improvement by using Horovod on one of our IBM machines with 6 V100 cards.

I didn't added any CI because I don't have any knowledge of it.

If you need any help with Horovod don't hesitate to ask.

CLAassistant · 2021-08-03T08:42:46Z

All committers have signed the CLA.

NanoNabla · 2021-09-14T13:29:03Z

My PR seems to be unregarded since I made it in May.
Are you interested in parallel training as in DeepSpeech?

If you are interested in it I would try to get my PR able to merge again. Otherwise feel free to close this PR.

NanoNabla added 2 commits May 5, 2021 23:20

multi machine parallelization using horovod

28f1f82

horovod documentation

5f23b13

reuben force-pushed the main branch from 0b5be89 to 2833797 Compare July 13, 2021 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement distributed training using horovod #1865

Implement distributed training using horovod #1865

NanoNabla commented May 5, 2021

CLAassistant commented Aug 3, 2021 •

edited

Loading

NanoNabla commented Sep 14, 2021

Implement distributed training using horovod #1865

Are you sure you want to change the base?

Implement distributed training using horovod #1865

Conversation

NanoNabla commented May 5, 2021

CLAassistant commented Aug 3, 2021 • edited Loading

NanoNabla commented Sep 14, 2021

CLAassistant commented Aug 3, 2021 •

edited

Loading