[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

breakds · 2021-07-23T22:39:18Z

Motivation

The effort of #913 requires updating the entry point script alf/bin/train.py. Such updates will rearrange some of the code logic in order to minimize code duplication. I think it is a good practice to split the work and submit the refactor earlier so that the work is done in a more incremental fashion, with the benefits of

Since this is mainly a refactor PR, much easier to review (and to revert if needed)
The DDP project can take a few days to finish, and if someone has to update the original train.py, the risk of resolving conflicts increases. Submitting this PR first will minimize such risk.

How is it refactored?

Abstract functions that does the setup such as logging and snashop.
Added a new flag --distributed which takes enum value. The multi-gpu is currently disabled explicitly.
Convert train_eval into training_worker so that it does the training/evaluation job as a single proce
Later, the multi-gpu mode is just about running several processes via multiprocessing, while each process just runs train_eval_worker.

How is it tested?

Disable multi-gpu - Verified that when multi-gpu is specified in --distributed, it prompts to tell the user that DDP is currently unavailable
No-regression - Verified that when running with single gpu mode by not specifying --distributed (i.e. the same as before), it runs ac_breakout successfully and the score hits 50 in 4k training steps.

alf/bin/train.py

breakds · 2021-07-25T01:30:55Z

All comments resolved. PTAL.

alf/bin/train.py

…nd multi GPU training

…nd multi GPU training (HorizonRobotics#913) (HorizonRobotics#944) * [REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training * Address Wei's comments * Address Haonan's comments * Specify authoritative url and port as well * Remove unused Optional typing

breakds requested review from emailweixu and hnyu July 23, 2021 22:39

breakds added the refactor label Jul 23, 2021

emailweixu reviewed Jul 23, 2021

View reviewed changes

alf/bin/train.py Outdated Show resolved Hide resolved

alf/bin/train.py Outdated Show resolved Hide resolved

alf/bin/train.py Outdated Show resolved Hide resolved

breakds force-pushed the PR_distributed_train_py branch from c6b3954 to 0ceb0cc Compare July 24, 2021 00:39

hnyu requested changes Jul 24, 2021

View reviewed changes

hnyu reviewed Jul 25, 2021

View reviewed changes

alf/bin/train.py Show resolved Hide resolved

breakds added 5 commits July 26, 2021 14:15

[REFACTOR] train.py to consolidate common logic for both single GPU a…

b86fb1a

…nd multi GPU training

Address Wei's comments

a1d43a5

Address Haonan's comments

4bc0ee6

Specify authoritative url and port as well

13fa63f

Remove unused Optional typing

273810b

breakds force-pushed the PR_distributed_train_py branch from 72418e7 to 273810b Compare July 26, 2021 21:15

emailweixu approved these changes Jul 26, 2021

View reviewed changes

hnyu approved these changes Jul 26, 2021

View reviewed changes

breakds merged commit 704e5c9 into pytorch Jul 26, 2021

breakds deleted the PR_distributed_train_py branch July 26, 2021 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

breakds commented Jul 23, 2021

breakds commented Jul 25, 2021

[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

Conversation

breakds commented Jul 23, 2021

Motivation

How is it refactored?

How is it tested?

breakds commented Jul 25, 2021