Support Distributed Training in alf #913

breakds · 2021-06-28T22:13:18Z

As discussed with @emailweixu, it would be nice to have alf support multi-GPU training. The goals are

As a starting project, help me gain enough knowledge about the repository itself
Being able to to train models on machines with multiple GPU (single machine)

hnyu · 2021-06-28T22:45:34Z

@breakds FYI, two reference papers I came across a while ago (RL scenario):

https://openreview.net/pdf?id=H1gX8C4YPr
https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

Although they are proposed for multi-machine training, our multi-gpu single-machine case is a special and simpler case.

Or refer to Pytorch official multi-gpu support (general DL scenario):
https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

breakds · 2021-06-29T00:51:37Z

Thanks @hnyu for the references!

breakds · 2021-07-08T21:02:50Z

DD-PPO, the first paper's main idea is about early stopping the slow simulation during rollout with a batched environment (potentially distributed over different machines in a cluster), and try to use the full experience from some of the environment and partial experience from the early-stopped ones during the training in each iteration. I think we can borrow the ideas in the near future.

As the first step, I will look into how pytorch's DataParallel is implemented and use it (or similar techniques) to enable multi-GPU single-machine's

Network(s) forward evaluation during rollout
Network(s) forward/backward evaluation during training

in each of the training iteration.

breakds · 2021-07-13T22:01:25Z

Currently I am hitting two problems with DataParallel:

When tried on a simple backward() operation, DataParallel version (2 GPU) is taking 1.8 seconds while the sinlge GPU version only takes 0.5 seconds, and I haven't figured out why. I think this at least suggest the overhead of DataParallel is pretty significant.
Directly applying DataParallel on our network and it will crash. This is another thing that I am working on.

hnyu · 2021-07-13T22:17:00Z

When tried on a simple backward() operation, DataParallel version (2 GPU) is taking 1.8 seconds while the sinlge GPU version only takes 0.5 seconds, and I haven't figured out why. I think this at least suggest the overhead of DataParallel is pretty significant.

I think multi-gpu only makes sense for a large mini-batch with intensive computation. What is your setup?

breakds · 2021-07-13T23:52:31Z

When tried on a simple backward() operation, DataParallel version (2 GPU) is taking 1.8 seconds while the sinlge GPU version only takes 0.5 seconds, and I haven't figured out why. I think this at least suggest the overhead of DataParallel is pretty significant.

I think multi-gpu only makes sense for a large mini-batch with intensive computation. What is your setup?

Yep I think that is what happened. I was testing the forward and backward of a network like this:

class Network(nn.Module):
    def __init__(self, input_size, output_size):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, 384)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(384, 64)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(64, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h = self.fc1(x)
        h = self.relu1(h)
        h = self.fc2(h)
        h = self.relu2(h)
        h = self.fc3(h)
        h = self.relu3(h)
        h = self.fc4(h)
        h = self.sigmoid(h)
        return h

I realized that this is probably too small because even if a batch of 25600 is passed in, it pretty much does not change in terms of the consumed time.

I am now trying to fix the issue in No.2 so that I can do experiment on an actual network that is used in alf.

hnyu · 2021-07-13T23:58:42Z

When tried on a simple backward() operation, DataParallel version (2 GPU) is taking 1.8 seconds while the sinlge GPU version only takes 0.5 seconds, and I haven't figured out why. I think this at least suggest the overhead of DataParallel is pretty significant.

I think multi-gpu only makes sense for a large mini-batch with intensive computation. What is your setup?

Yep I think that is what happened. I was testing the forward and backward of a network like this:
class Network(nn.Module):
    def __init__(self, input_size, output_size):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, 384)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(384, 64)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(64, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h = self.fc1(x)
        h = self.relu1(h)
        h = self.fc2(h)
        h = self.relu2(h)
        h = self.fc3(h)
        h = self.relu3(h)
        h = self.fc4(h)
        h = self.sigmoid(h)
        return h
I realized that this is probably too small because even if a batch of 25600 is passed in, it pretty much does not change in terms of the consumed time.

I am now trying to fix the issue in No.2 so that I can do experiment on an actual network that is used in alf.

Our expected scenario for multi-gpu is image inputs with a large batch size. So you could try dummy image inputs instead.

Besides running time, also another scenario is to split sgd memory consumption into multiple cards, if one card is not enough.

breakds · 2021-07-14T00:05:05Z

When tried on a simple backward() operation, DataParallel version (2 GPU) is taking 1.8 seconds while the sinlge GPU version only takes 0.5 seconds, and I haven't figured out why. I think this at least suggest the overhead of DataParallel is pretty significant.

I think multi-gpu only makes sense for a large mini-batch with intensive computation. What is your setup?

Yep I think that is what happened. I was testing the forward and backward of a network like this:
class Network(nn.Module):
    def __init__(self, input_size, output_size):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, 384)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(384, 64)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(64, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h = self.fc1(x)
        h = self.relu1(h)
        h = self.fc2(h)
        h = self.relu2(h)
        h = self.fc3(h)
        h = self.relu3(h)
        h = self.fc4(h)
        h = self.sigmoid(h)
        return h
I realized that this is probably too small because even if a batch of 25600 is passed in, it pretty much does not change in terms of the consumed time.
I am now trying to fix the issue in No.2 so that I can do experiment on an actual network that is used in alf.
Our expected scenario for multi-gpu is image inputs with a large batch size. So you could try dummy image inputs instead.

Besides running time, also another scenario is to split sgd memory consumption into multiple cards, if one card is not enough.

That makes a lot of sense. Thanks for the suggestions and clarification!

breakds · 2021-07-14T00:27:13Z

I was using ActorDistributionNetwork with a batch of random generated images to run the experiment, and got

Traceback (most recent call last):
  File "/nix/store/4s0h5aawbap3xhldxhcijvl26751qrjr-python3-3.8.9/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nix/store/4s0h5aawbap3xhldxhcijvl26751qrjr-python3-3.8.9/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/breakds/projects/alf/alf/bin/experiment/dp_network_experiment.py", line 42, in <module>
    action_distribution, actor_state = actor_network(observation, state=())
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    return self.gather(outputs, self.output_device)
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 180, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 76, in gather
    res = gather_map(outputs)
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 71, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/nix/store/1nhxgafz45v9sivabxw0aqr0dvpyw1nc-python3-3.8.9-env/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 71, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: 'Categorical' object is not iterable

With some investigation, I realized that it fails because DataParallel internally does not know how to combine Categorical objects, which is the output of ActorDistributionNetwork. DataParallel works in two steps

Scatter (think of map)
Gather (think of reduce)

And this problem happens at the last step of "Gather". I will use a slightly modified network to continue experiment to work around this.

However, the final solution should make multi-GPU as transparent as possible so that it is convenient to use.

Directly applying DataParallel may not be the right solution we are looking for partly because of the above issue. This is something to think about later.

breakds · 2021-07-14T00:42:54Z

After slightly modifying the ActorDistributionNetwork (for experiment purpose), I was able to run DataParallel with 2 x 3080:

import torch
import torch.nn as nn
import alf
from alf.networks import ActorDistributionNetwork
from alf.tensor_specs import BoundedTensorSpec
import functools
import time

if __name__ == '__main__':
    alf.set_default_device('cuda')

    CONV_LAYER_PARAMS = ((32, 8, 4), (64, 4, 2), (64, 3, 1))

    actor_network_cls = functools.partial(
        ActorDistributionNetwork,
        fc_layer_params=(512, ),
        conv_layer_params=CONV_LAYER_PARAMS)

    actor_network = nn.DataParallel(actor_network_cls(
        input_tensor_spec=BoundedTensorSpec(
            shape=(4, 150, 150), dtype=torch.float32, minimum=0., maximum=1.),
        action_spec=BoundedTensorSpec(
            shape=(), dtype=torch.int64, minimum=0, maximum=3)))

    start_time = time.time()
    for i in range(1000):
        observation = torch.rand(640, 4, 150, 150)
        action_distribution, actor_state = actor_network(observation, state=())
    print(f'{time.time() - start_time} seconds elapsed')

I can see the load being distributed to 2 cards (as well as the memory being distributed). However, compared to running the same piece of code on single 3080 without DataParallel:

The memory consumption on both card together is significantly > single card non-data-parallel version
The non-DataParallel version took 6s to finish on single card. The DataParallel version took 1 minutes.

This almost rendered DataParallel not usable. Though I will continue investigate to see why such odd behavior exists. Will discuss with people with more experience in this tomorrow.

hnyu · 2021-07-14T01:36:11Z

After slightly modifying the ActorDistributionNetwork (for experiment purpose), I was able to run DataParallel with 2 x 3080:
import torch
import torch.nn as nn
import alf
from alf.networks import ActorDistributionNetwork
from alf.tensor_specs import BoundedTensorSpec
import functools
import time

if __name__ == '__main__':
    alf.set_default_device('cuda')

    CONV_LAYER_PARAMS = ((32, 8, 4), (64, 4, 2), (64, 3, 1))

    actor_network_cls = functools.partial(
        ActorDistributionNetwork,
        fc_layer_params=(512, ),
        conv_layer_params=CONV_LAYER_PARAMS)

    actor_network = nn.DataParallel(actor_network_cls(
        input_tensor_spec=BoundedTensorSpec(
            shape=(4, 150, 150), dtype=torch.float32, minimum=0., maximum=1.),
        action_spec=BoundedTensorSpec(
            shape=(), dtype=torch.int64, minimum=0, maximum=3)))

    start_time = time.time()
    for i in range(1000):
        observation = torch.rand(640, 4, 150, 150)
        action_distribution, actor_state = actor_network(observation, state=())
    print(f'{time.time() - start_time} seconds elapsed')
I can see the load being distributed to 2 cards (as well as the memory being distributed). However, compared to running the same piece of code on single 3080 without DataParallel:

The memory consumption on both card together is significantly > single card non-data-parallel version

The non-DataParallel version took 6s to finish on single card. The DataParallel version took 1 minutes.

This almost rendered DataParallel not usable. Though I will continue investigate to see why such odd behavior exists. Will discuss with people with more experience in this tomorrow.

The inefficiency of DataParallel seems unreasonable. There must be something wrong going on.

breakds · 2021-07-14T03:12:29Z

After slightly modifying the ActorDistributionNetwork (for experiment purpose), I was able to run DataParallel with 2 x 3080:
import torch
import torch.nn as nn
import alf
from alf.networks import ActorDistributionNetwork
from alf.tensor_specs import BoundedTensorSpec
import functools
import time

if __name__ == '__main__':
    alf.set_default_device('cuda')

    CONV_LAYER_PARAMS = ((32, 8, 4), (64, 4, 2), (64, 3, 1))

    actor_network_cls = functools.partial(
        ActorDistributionNetwork,
        fc_layer_params=(512, ),
        conv_layer_params=CONV_LAYER_PARAMS)

    actor_network = nn.DataParallel(actor_network_cls(
        input_tensor_spec=BoundedTensorSpec(
            shape=(4, 150, 150), dtype=torch.float32, minimum=0., maximum=1.),
        action_spec=BoundedTensorSpec(
            shape=(), dtype=torch.int64, minimum=0, maximum=3)))

    start_time = time.time()
    for i in range(1000):
        observation = torch.rand(640, 4, 150, 150)
        action_distribution, actor_state = actor_network(observation, state=())
    print(f'{time.time() - start_time} seconds elapsed')
I can see the load being distributed to 2 cards (as well as the memory being distributed). However, compared to running the same piece of code on single 3080 without DataParallel:

The memory consumption on both card together is significantly > single card non-data-parallel version

The non-DataParallel version took 6s to finish on single card. The DataParallel version took 1 minutes.

This almost rendered DataParallel not usable. Though I will continue investigate to see why such odd behavior exists. Will discuss with people with more experience in this tomorrow.
The inefficiency of DataParallel seems unreasonable. There must be something wrong going on.

Or maybe this is by design, I can try to look into where the time is being spent.

emailweixu · 2021-07-14T06:30:06Z

According to https://pytorch.org/tutorials/intermediate/ddp_tutorial.html, DataParallel might be even slower than DistributedDataParallel

breakds · 2021-07-14T07:16:16Z

According to https://pytorch.org/tutorials/intermediate/ddp_tutorial.html, DataParallel might be even slower than DistributedDataParallel

Yep, I can see that GIL issue makes sense. DistributedDataParallel is even harder to integrate - if we are willing to spend more effort it would probably be better to roll our own solution that suits us better.

@hnyu and I chatted about this today, and I agree with Haonan that we might want to adjust our goal and go for a slightly more complicated (i.e. might require structural update) customized solution. We can chat more about this tomorrow.

breakds · 2021-07-15T22:58:09Z

Successfully running DDP on the 2-GPU machine with ActorDistributionNetwork. Preliminary result shows about 25% performance improvement vs the non-parallel version (this is only a single data point, because it is from that 2-GPU machine).
Experiment on running DDP with alf. There are definitely a lot of caveats. I am very close to get ac_breakout running, but at this moment I need to resolve Decouple environment creation and configuration file parsing #930 first.