Training with fc and multi-gpu is much slower than single gpu #12577

liu6381810 · 2018-09-17T11:05:26Z

liu6381810
Sep 17, 2018

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

trainning with multi-gpu is much slower than single gpu

Environment info (Required)

Ubuntu16.04
python2.7
latest mxnet
I use symbol not gloun
8 * V100

Details:

Actually I need train fm for Recommender Systems，and I don't follow the sparse example in mxnet
I use the embedding layer to do this
But I find if I use one V100 with batch_size=8192 The speed is 40w+ sample
and if I change to use 8*V100 with batch_size=8192 which means every V100 got batch_size=1024 then I get 1.8W+sample per second
It shows with 8 * V100 is much slower than one V100
I don't think the limit is data io because I can get 40w+ sample per second with one V100

Also the there is a gpu whose used memory is larger than other 7 gpu's used memory
And its utilize is also higher
I think it's because my kvstore is set to device and this gpu need to collect gradient and update

I also use 40 fc layers to fit data generated by y=ax+b
2-gpu is also slower than 1-gpu with the same batch_size

So what's the reason about this？I would appreciate it if anyone can give any advice!

Below is my symbol

    self.feat_index = mx.symbol.Variable(name="feat_index") #batch * F
    self.feat_value = mx.symbol.Variable(name="feat_value") #batch * F
    self.label = mx.symbol.Variable(name="label")
    
    self.weights = self._initialize_weights()
    

    self.embeddings = mx.symbol.Embedding(self.feat_index,
                                          input_dim = self.feature_size,
                                          output_dim = self.embedding_size, name="embed1")
    
    feat_value = mx.symbol.reshape(self.feat_value, [-1, self.field_size, 1]) #batch * F * 1
    self.embeddings = mx.symbol.broadcast_mul(self.embeddings, feat_value)
    
    # ---------- first order term ----------
    
    self.y_first_order = mx.symbol.Embedding(self.feat_index,
                                             input_dim = self.feature_size,
                                             output_dim = 1, name="embed2")
    self.y_first_order = mx.symbol.sum(mx.symbol.elemwise_mul(self.y_first_order, feat_value), 2)  # None * F

    
    # ---------- second order term ---------------
    
    self.summed_features_emb = mx.symbol.sum(self.embeddings, 1)  # None * K
    self.summed_features_emb_square = mx.symbol.square(self.summed_features_emb)  # None * K

    # square_sum part
    self.squared_features_emb = mx.symbol.square(self.embeddings)
    self.squared_sum_features_emb = mx.symbol.sum(self.squared_features_emb, 1)  # None * K


    self.y_second_order = 0.5 * mx.symbol.elemwise_sub(self.summed_features_emb_square, 
                                                       self.squared_sum_features_emb)  # None * K
    
    
    self.concat_input = mx.symbol.concat(self.y_first_order, self.y_second_order, dim=1)
    self.out = mx.symbol.sum(mx.symbol.broadcast_add(mx.symbol.dot(self.concat_input, 
                                                                       self.weights["concat_projection"]), 
                                                         self.weights["concat_bias"]),1)
    
    self.model = mx.symbol.LogisticRegressionOutput(self.out, self.label)

kalyc · 2018-09-17T17:01:24Z

kalyc
Sep 17, 2018

Thanks for submitting the issue
@mxnet-label-bot[Performance, Question]

0 replies

eric-haibin-lin · 2018-09-17T18:17:28Z

eric-haibin-lin
Sep 17, 2018
Collaborator

Did you scale the batch size linearly with the number of GPUs?

0 replies

liu6381810 · 2018-09-18T02:32:56Z

liu6381810
Sep 18, 2018
Author

@eric-haibin-lin
Yes，
1 V100 batch 8192 40w+ samples/s
8 V100 batch 8192 1.8w samples/s
8 V100 batch 65536 8w+ samples/s

0 replies

thomelane · 2018-10-02T22:35:27Z

thomelane
Oct 2, 2018

Hi @liu6381810,

Seems like there's an overhead to using multiple GPUs, and one possible source is the transfer of gradients between GPUs. Are you using an AWS EC2 p3.16xlarge instance for this, or do you have your own server here? Check nvidia-smi topo --matrix to confirm that you have fast GPU to GPU communications. You could also take a look at gradient compression to reduce the amount of data being transferred: see this tutorial for more information.

0 replies

piyushghai · 2018-10-10T21:58:04Z

piyushghai
Oct 10, 2018

@liu6381810 Did the previous suggestions help you in this case ?

0 replies

sandeep-krishnamurthy · 2019-04-01T23:41:52Z

sandeep-krishnamurthy
Apr 1, 2019
Collaborator

@liu6381810 - Waiting for you updates.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with fc and multi-gpu is much slower than single gpu #12577

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training with fc and multi-gpu is much slower than single gpu #12577

liu6381810 Sep 17, 2018

Description

Environment info (Required)

Details:

Replies: 6 comments

kalyc Sep 17, 2018

eric-haibin-lin Sep 17, 2018 Collaborator

liu6381810 Sep 18, 2018 Author

thomelane Oct 2, 2018

piyushghai Oct 10, 2018

sandeep-krishnamurthy Apr 1, 2019 Collaborator

liu6381810
Sep 17, 2018

kalyc
Sep 17, 2018

eric-haibin-lin
Sep 17, 2018
Collaborator

liu6381810
Sep 18, 2018
Author

thomelane
Oct 2, 2018

piyushghai
Oct 10, 2018

sandeep-krishnamurthy
Apr 1, 2019
Collaborator