Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training with DDP #1096

Open
7 of 15 tasks
breakds opened this issue Nov 30, 2021 · 14 comments
Open
7 of 15 tasks

Multi-GPU Training with DDP #1096

breakds opened this issue Nov 30, 2021 · 14 comments
Assignees

Comments

@breakds
Copy link
Contributor

breakds commented Nov 30, 2021

This is a follow-up to #913

Motivation

Add full support for multi-process and multi-GPU training in alf with pytorch's DDP.

Goals

While achieving the main goals above, we should also make sure that the following specific use cases are considered.

  • In a composite algorithm such PPG, all the networks involved in training should be properly distributed.
  • If evaluation is turned on, it should only be turned on for process with rank = 0
  • There can be parameters that are not updated via backward and optimizer (e.g. target updater in SAC). Make sure that the behavior is consistent with the non-distributed version.
  • There can be other variables that can affect training but not synchronized via DDP (e.g. batch normalization). This can introduce inconsistency with the non-distributed version. Investigate whether such inconsistency can be a problem.
  • Find a reasonable way to adapt the metrics
  • Find a reasonable way to reinterpret the termination conditions such as num_env_steps
  • Resolve the problem when terminated by SIGINT, there are defunct zombie processes left
  • Run on 4 - 8 cards on Cluster

Blockers and Issues:

  • DDP + PPG + MetaDrive with default configuration may get stuck, and the performance is really bad. However, it is verified that DDP + PPG + Procgen reaches the performance parity (and even slightly better).
@breakds breakds self-assigned this Nov 30, 2021
@breakds
Copy link
Contributor Author

breakds commented Nov 30, 2021

What exactly should DDP wrap?

Essentially, a DDP module wraps a set of parameters, and a function f (as the forward() of the DDP module). After the wrapping, when forward() (i.e. f) is called, all the result and intermediate result that depends on the parameters will be marked and autograd hooks are injected for those results.

Later when the results' backward() is called, those hooks will invoke reducer (synchronization between subprocesses).

Note that the results' backward() can be either directly called or indirectly (i.e. as a result of calling backward() on values that depends on them) called. This means that as long as in loss(f(x)) all the to-be-updated parameters are used inf, we only need to wrap f (as opposite to having to wrap loss()).

@breakds
Copy link
Contributor Author

breakds commented Nov 30, 2021

Can we have more than 1 DDP wrapped modules in a distributed training?

The answer is yes. Theoretically it works and I coded an experiment to verify that. It is worth noting that if you have more than 1 DDP wrapped modules, the order of calling in different subprocesses needs to be exactly the same. Because of how DDP works, if the order is different, the reducer of module A in process 1 might be waiting for its counterpart in process 2, while in the process 2 the reducer of module B is waiting for its counterpart in process 1 - effectively a textbook example of deadlock.

@breakds
Copy link
Contributor Author

breakds commented Dec 1, 2021

While working on enabling @data_distributed decorator for the off-policy branch, I hit a blocker that at the initial sync, there will be exception complaining: "Tensors must be CUDA and dense".

After some digging I found that the problem comes from the fact that when DDP start to sync (reduce), it will sync the buffers of the wrapped module as well. All the offending buffers are within the replay buffer. I am working on a generic way to rule them out before being wrapped by the DDP.

One of the problems is that the replay_buffer can be found in named_buffers() but not in state_dict() of the wrapped module. Investigating the reason now.

@breakds
Copy link
Contributor Author

breakds commented Dec 1, 2021

After explicit filtering out _replay_buffer in named_buffers(), I was able to successfully train ppo_cart_pole with DDP.

ddp_ppo_cart_pole

With pretty much half the training time (although it does not account for much in each training iteration under this setting and this project).

ddp_ppo_cart_pole_time

(Dark blue is DDP, with 2x GPU)

@breakds
Copy link
Contributor Author

breakds commented Dec 3, 2021

With some hack I was able to run PPG with DDP on two 3080s. Below is the comparison of the same setup trained on

  1. Dark Blue - single 3090 with Intel CPU
  2. Light Blue - DDP on dual 3080 with AMD threadripper

ddp_ppg_procgen_return

Note that the DDP version did better when looking at the by env steps graph:

ddp_ppg_return_by_env_steps

Also, the time consumed is less than single 3090:

ddp_ppg_return_time

It is actually not 2x but 1.5x faster. I think one of the factor is that 3090 has a better performance than a single 3080.

Another reason could be in this hacky version I had to let DDP figure out what parameters are "unused" which adds overhead. I am still working on remove those hacks.

@breakds
Copy link
Contributor Author

breakds commented Dec 4, 2021

I got stuck on how find_unused_parameters is working for DDP, which is crucial to PPG. I suspect there are bugs in find_unused_parameters or hidden assumptions that I am not aware of. Will need to have more experiments.

The reason we need it for PPG is that PPG's network's auxiliary output is not used for policy phase update, but only in auxiliary phase update. Therefore corresponding parameters becomes "unused", and DDP does not like that as it is waiting for hooks to be called on all parameters.

@breakds
Copy link
Contributor Author

breakds commented Dec 6, 2021

The above issues can be resolved by #1114 and #1117

@breakds
Copy link
Contributor Author

breakds commented Dec 7, 2021

When turning on DDP, PPG + Metadrive can get stuck after several iterations (or several hundreds of iterations) arbitrarily. To make sure that it is DDP causing the problem, I also ran another training without DDP, and the result looks good. See below for the comparison.

ddp_ppg_metadrive_issue

  1. It occurs to me that maybe the "getting stuck" problem is caused by "Insanely long episode", so that MetaDrive simulator get stuck somehow. Adding TimeLimit wrapper may help.
  2. Also, we expect the same or similar training dynamics with DDP turned on/off. Such dramatic difference indicates that there are some inconsistency and I need to find out why.

@breakds
Copy link
Contributor Author

breakds commented Dec 8, 2021

See the red line below, when the auxiliary phase is turned off (effectively PPO), the getting stuck problem did no reproduce and the training dynamics seemed normal (it is not as efficient as PPG which is in general a fact we know).

ddp_ppg_metadrive_issue_1

@breakds
Copy link
Contributor Author

breakds commented Dec 8, 2021

....
INFO:absl:[rank = 0] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 0]
INFO:absl:[rank = 1] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 1]
INFO:absl:[rank = 0] End u = 0, b = 945
INFO:absl:[rank = 1] End u = 0, b = 945
INFO:absl:[rank = 0] Run _update() of b0/960, u = 1
Perform _compute_train_info_and_loss_info with [rank 0]

Explanation of the above debugging log, see below.

Further debugging shows that when getting stuck, it is inside the _update() of the PPGAuxAlgorithm (i.e. auxiliary phase update). We are about to complete 6 updates (u goes from 0 to 5) per process (there are 2 processes, rank 0 and rank 1). Everything was fine until the first update of both processes complete. In the next update (u = 1), only the process with rank = 0 called _compute_train_info_and_loss_info, but not the one with rank = 1. Because of DDP needs to synchronize when both process has finished calling this function, it will wait forever.

@breakds
Copy link
Contributor Author

breakds commented Dec 8, 2021

Was able to pin point the problem at

                experience = alf.nest.map_structure(lambda x: x[indices],
                                                    experience)

Where one of the process got stuck here, which is outside the DDP-wrapped code. This is consistently reproducible on 2 different machines.

The above code comes from https://github.com/HorizonRobotics/alf/blob/pytorch/alf/algorithms/algorithm.py#L1372

Both experience and indices are on cpu, so this is a CPU operation.

@breakds
Copy link
Contributor Author

breakds commented Dec 8, 2021

This one seems related: https://discuss.pytorch.org/t/training-get-stuck-at-some-iteration-step/48329
There does not seem to be a solution yet.

@breakds
Copy link
Contributor Author

breakds commented Dec 9, 2021

Latest experiment result - after moving shuffle into per mini_batch, it worked around the previous stuck point. However, it then freezes at calling the DDP wrapped function _compute_train_info_and_loss_info.

@breakds
Copy link
Contributor Author

breakds commented Dec 9, 2021

Can rule out find_ununsed_parameters as the cause. I tried a hack that worked around find_unused_parameters and the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant