Multi-GPU Training with DDP #1096

breakds · 2021-11-30T00:20:43Z

This is a follow-up to #913

Motivation

Add full support for multi-process and multi-GPU training in alf with pytorch's DDP.

Goals

A general DDP wrapper that can convert a function to DDP module, where calling the forward of the wrapped DDP module is equivalent to calling the original function, with distributed hooks added to the result Implement @data_distributed Decorator #1098
Update single machine multiple GPU training for on policy branch (e.g. standard Actor Critic) Implement @data_distributed Decorator #1098
Support single machine multiple GPU training on off policy branch (e.g. PPO) Enable DDP for Off-Policy training branch #1099
unroll() should not go through DDP in off-policy branch Conditional @data_distributed_when to disable DDP on unroll for Off-policy Algorithms #1114
Investigate how reducer's bucket size affects the performance
Compare the performance with ZeroRedundancyOptimizer turned on and off

While achieving the main goals above, we should also make sure that the following specific use cases are considered.

In a composite algorithm such PPG, all the networks involved in training should be properly distributed.
If evaluation is turned on, it should only be turned on for process with rank = 0
There can be parameters that are not updated via backward and optimizer (e.g. target updater in SAC). Make sure that the behavior is consistent with the non-distributed version.
There can be other variables that can affect training but not synchronized via DDP (e.g. batch normalization). This can introduce inconsistency with the non-distributed version. Investigate whether such inconsistency can be a problem.
Find a reasonable way to adapt the metrics
Find a reasonable way to reinterpret the termination conditions such as num_env_steps
Resolve the problem when terminated by SIGINT, there are defunct zombie processes left
Run on 4 - 8 cards on Cluster

Blockers and Issues:

DDP + PPG + MetaDrive with default configuration may get stuck, and the performance is really bad. However, it is verified that DDP + PPG + Procgen reaches the performance parity (and even slightly better).

The text was updated successfully, but these errors were encountered:

breakds · 2021-11-30T00:58:02Z

What exactly should DDP wrap?

Essentially, a DDP module wraps a set of parameters, and a function f (as the forward() of the DDP module). After the wrapping, when forward() (i.e. f) is called, all the result and intermediate result that depends on the parameters will be marked and autograd hooks are injected for those results.

Later when the results' backward() is called, those hooks will invoke reducer (synchronization between subprocesses).

Note that the results' backward() can be either directly called or indirectly (i.e. as a result of calling backward() on values that depends on them) called. This means that as long as in loss(f(x)) all the to-be-updated parameters are used inf, we only need to wrap f (as opposite to having to wrap loss()).

breakds · 2021-11-30T01:32:22Z

Can we have more than 1 DDP wrapped modules in a distributed training?

The answer is yes. Theoretically it works and I coded an experiment to verify that. It is worth noting that if you have more than 1 DDP wrapped modules, the order of calling in different subprocesses needs to be exactly the same. Because of how DDP works, if the order is different, the reducer of module A in process 1 might be waiting for its counterpart in process 2, while in the process 2 the reducer of module B is waiting for its counterpart in process 1 - effectively a textbook example of deadlock.

breakds · 2021-12-01T19:33:57Z

While working on enabling @data_distributed decorator for the off-policy branch, I hit a blocker that at the initial sync, there will be exception complaining: "Tensors must be CUDA and dense".

After some digging I found that the problem comes from the fact that when DDP start to sync (reduce), it will sync the buffers of the wrapped module as well. All the offending buffers are within the replay buffer. I am working on a generic way to rule them out before being wrapped by the DDP.

One of the problems is that the replay_buffer can be found in named_buffers() but not in state_dict() of the wrapped module. Investigating the reason now.

breakds · 2021-12-01T20:08:50Z

After explicit filtering out _replay_buffer in named_buffers(), I was able to successfully train ppo_cart_pole with DDP.

With pretty much half the training time (although it does not account for much in each training iteration under this setting and this project).

(Dark blue is DDP, with 2x GPU)

breakds · 2021-12-03T22:53:15Z

With some hack I was able to run PPG with DDP on two 3080s. Below is the comparison of the same setup trained on

Dark Blue - single 3090 with Intel CPU
Light Blue - DDP on dual 3080 with AMD threadripper

Note that the DDP version did better when looking at the by env steps graph:

Also, the time consumed is less than single 3090:

It is actually not 2x but 1.5x faster. I think one of the factor is that 3090 has a better performance than a single 3080.

Another reason could be in this hacky version I had to let DDP figure out what parameters are "unused" which adds overhead. I am still working on remove those hacks.

breakds · 2021-12-04T01:54:31Z

I got stuck on how find_unused_parameters is working for DDP, which is crucial to PPG. I suspect there are bugs in find_unused_parameters or hidden assumptions that I am not aware of. Will need to have more experiments.

The reason we need it for PPG is that PPG's network's auxiliary output is not used for policy phase update, but only in auxiliary phase update. Therefore corresponding parameters becomes "unused", and DDP does not like that as it is waiting for hooks to be called on all parameters.

breakds · 2021-12-06T19:49:48Z

The above issues can be resolved by #1114 and #1117

breakds · 2021-12-07T17:41:16Z

When turning on DDP, PPG + Metadrive can get stuck after several iterations (or several hundreds of iterations) arbitrarily. To make sure that it is DDP causing the problem, I also ran another training without DDP, and the result looks good. See below for the comparison.

It occurs to me that maybe the "getting stuck" problem is caused by "Insanely long episode", so that MetaDrive simulator get stuck somehow. Adding TimeLimit wrapper may help.
Also, we expect the same or similar training dynamics with DDP turned on/off. Such dramatic difference indicates that there are some inconsistency and I need to find out why.

breakds · 2021-12-08T00:15:05Z

See the red line below, when the auxiliary phase is turned off (effectively PPO), the getting stuck problem did no reproduce and the training dynamics seemed normal (it is not as efficient as PPG which is in general a fact we know).

breakds · 2021-12-08T01:48:24Z

....
INFO:absl:[rank = 0] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 0]
INFO:absl:[rank = 1] Run _update() of b945/960, u = 0
Perform _compute_train_info_and_loss_info with [rank 1]
INFO:absl:[rank = 0] End u = 0, b = 945
INFO:absl:[rank = 1] End u = 0, b = 945
INFO:absl:[rank = 0] Run _update() of b0/960, u = 1
Perform _compute_train_info_and_loss_info with [rank 0]

Explanation of the above debugging log, see below.

Further debugging shows that when getting stuck, it is inside the _update() of the PPGAuxAlgorithm (i.e. auxiliary phase update). We are about to complete 6 updates (u goes from 0 to 5) per process (there are 2 processes, rank 0 and rank 1). Everything was fine until the first update of both processes complete. In the next update (u = 1), only the process with rank = 0 called _compute_train_info_and_loss_info, but not the one with rank = 1. Because of DDP needs to synchronize when both process has finished calling this function, it will wait forever.

breakds · 2021-12-08T04:28:31Z

Was able to pin point the problem at

                experience = alf.nest.map_structure(lambda x: x[indices],
                                                    experience)

Where one of the process got stuck here, which is outside the DDP-wrapped code. This is consistently reproducible on 2 different machines.

The above code comes from https://github.com/HorizonRobotics/alf/blob/pytorch/alf/algorithms/algorithm.py#L1372

Both experience and indices are on cpu, so this is a CPU operation.

breakds · 2021-12-08T07:01:19Z

This one seems related: https://discuss.pytorch.org/t/training-get-stuck-at-some-iteration-step/48329
There does not seem to be a solution yet.

breakds · 2021-12-09T00:46:13Z

Latest experiment result - after moving shuffle into per mini_batch, it worked around the previous stuck point. However, it then freezes at calling the DDP wrapped function _compute_train_info_and_loss_info.

breakds · 2021-12-09T00:58:21Z

Can rule out find_ununsed_parameters as the cause. I tried a hack that worked around find_unused_parameters and the problem persists.

breakds mentioned this issue Nov 30, 2021

Support Distributed Training in alf #913

Closed

2 tasks

breakds self-assigned this Nov 30, 2021

breakds mentioned this issue Dec 1, 2021

Implement @data_distributed Decorator #1098

Merged

breakds mentioned this issue Dec 1, 2021

Enable DDP for Off-Policy training branch #1099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training with DDP #1096

Multi-GPU Training with DDP #1096

breakds commented Nov 30, 2021 •

edited

Loading

breakds commented Nov 30, 2021

breakds commented Nov 30, 2021

breakds commented Dec 1, 2021 •

edited

Loading

breakds commented Dec 1, 2021 •

edited

Loading

breakds commented Dec 3, 2021 •

edited

Loading

breakds commented Dec 4, 2021

breakds commented Dec 6, 2021

breakds commented Dec 7, 2021

breakds commented Dec 8, 2021

breakds commented Dec 8, 2021

breakds commented Dec 8, 2021 •

edited

Loading

breakds commented Dec 8, 2021

breakds commented Dec 9, 2021

breakds commented Dec 9, 2021

Multi-GPU Training with DDP #1096

Multi-GPU Training with DDP #1096

Comments

breakds commented Nov 30, 2021 • edited Loading

Motivation

Goals

Blockers and Issues:

breakds commented Nov 30, 2021

What exactly should DDP wrap?

breakds commented Nov 30, 2021

Can we have more than 1 DDP wrapped modules in a distributed training?

breakds commented Dec 1, 2021 • edited Loading

breakds commented Dec 1, 2021 • edited Loading

breakds commented Dec 3, 2021 • edited Loading

breakds commented Dec 4, 2021

breakds commented Dec 6, 2021

breakds commented Dec 7, 2021

breakds commented Dec 8, 2021

breakds commented Dec 8, 2021

breakds commented Dec 8, 2021 • edited Loading

breakds commented Dec 8, 2021

breakds commented Dec 9, 2021

breakds commented Dec 9, 2021

breakds commented Nov 30, 2021 •

edited

Loading

breakds commented Dec 1, 2021 •

edited

Loading

breakds commented Dec 1, 2021 •

edited

Loading

breakds commented Dec 3, 2021 •

edited

Loading

breakds commented Dec 8, 2021 •

edited

Loading