Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training (#913) #944

Merged
merged 5 commits into from
Jul 26, 2021

Conversation

breakds
Copy link
Contributor

@breakds breakds commented Jul 23, 2021

Motivation

The effort of #913 requires updating the entry point script alf/bin/train.py. Such updates will rearrange some of the code logic in order to minimize code duplication. I think it is a good practice to split the work and submit the refactor earlier so that the work is done in a more incremental fashion, with the benefits of

  1. Since this is mainly a refactor PR, much easier to review (and to revert if needed)
  2. The DDP project can take a few days to finish, and if someone has to update the original train.py, the risk of resolving conflicts increases. Submitting this PR first will minimize such risk.

How is it refactored?

  1. Abstract functions that does the setup such as logging and snashop.
  2. Added a new flag --distributed which takes enum value. The multi-gpu is currently disabled explicitly.
  3. Convert train_eval into training_worker so that it does the training/evaluation job as a single proce
  4. Later, the multi-gpu mode is just about running several processes via multiprocessing, while each process just runs train_eval_worker.

How is it tested?

  1. Disable multi-gpu - Verified that when multi-gpu is specified in --distributed, it prompts to tell the user that DDP is currently unavailable
  2. No-regression - Verified that when running with single gpu mode by not specifying --distributed (i.e. the same as before), it runs ac_breakout successfully and the score hits 50 in 4k training steps.

alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
alf/bin/train.py Show resolved Hide resolved
alf/bin/train.py Outdated Show resolved Hide resolved
@breakds
Copy link
Contributor Author

breakds commented Jul 25, 2021

All comments resolved. PTAL.

alf/bin/train.py Show resolved Hide resolved
@breakds breakds merged commit 704e5c9 into pytorch Jul 26, 2021
@breakds breakds deleted the PR_distributed_train_py branch July 26, 2021 23:46
pd-perry pushed a commit to pd-perry/alf that referenced this pull request Dec 11, 2021
…nd multi GPU training (HorizonRobotics#913) (HorizonRobotics#944)

* [REFACTOR] train.py to consolidate common logic for both single GPU and multi GPU training

* Address Wei's comments

* Address Haonan's comments

* Specify authoritative url and port as well

* Remove unused Optional typing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants