Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce PerProcessContext to adjust DDP per process behavior (#913) #955

Merged
merged 1 commit into from
Aug 3, 2021

Conversation

breakds
Copy link
Contributor

@breakds breakds commented Aug 2, 2021

Motivation

  • To make DDP as transparent to the user as possible, we should not expect the user to have to manually update a configuration to achieve parity between DDP mode and non DDP mode. More specifically, if single process has 64 environments in parallel, we would expect in dual process the per-process number of parallel environments to be reduced accordingly (to 32).
  • Similar to only having process 0 (master process) writing checkpoint, we want to have only process 0 writing the summary. This is done by not enabling summary unless the process rank is 0 (or -1) when setting up PolicyTrainer.

Therefore, this PR targets the combined problem of #953 and #954

Solution

Introduce a global singleton called PerProcessContext. As its name suggests, this context maintains properties that are private to each process. In the current implementation, PerProcessContext allows accessing the rank (process ID) and the total number of processes from anywhere.

Design Choices:

  1. Decided to use a global singleton to avoid having to pass context down to the very bottom of the calling stack, as discussed with @emailweixu in Adjust number of environments in DDP mode based on number of processes #953
  2. Decided to disable summary by hijacking run_under_record_context as a whole instead of just hijacking _cond. This is because run_under_record_context will initialize the writer and the write the global steps before the first call to _cond, and we would like to avoid that as well.
  3. Decided to transparently update the number of environments per process in the config, instead of providing a new configuration item, unlike what has been suggested by @le-horizon in Adjust number of environments in DDP mode based on number of processes #953. Both approaches have their benefits but I think having to update the config to run distributed training is a bit more work (for the user), and adding new configuration item may further damage the readability (of the config file). As long as the training/optimization is done the same, DDP and non-DDP shouldn't make a difference in terms of the training result.

Testing

  1. Run training without DDP to make sure summary and number of environments are the same as before.
  2. Run training with DDP with dual process to make sure
  • Number of processes are halved per process
  • There is only one events file being generated for summary

@breakds breakds force-pushed the PR_breakds_per_process_context branch from 79939e1 to 97e2e78 Compare August 2, 2021 21:16
@emailweixu
Copy link
Contributor

Should we add some unittest to ensure DDP can run correctly? (perhaps in a different PR). A possible place for this is alf/bin/train_play_test.py.

@breakds
Copy link
Contributor Author

breakds commented Aug 2, 2021

Should we add some unittest to ensure DDP can run correctly? (perhaps in a different PR). A possible place for this is alf/bin/train_play_test.py.

Agree. Created #956 for this.

Copy link
Contributor

@le-horizon le-horizon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@breakds breakds merged commit 774f51f into pytorch Aug 3, 2021
@breakds breakds deleted the PR_breakds_per_process_context branch August 3, 2021 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants