Training with fc and multi-gpu is much slower than single gpu #12577
Replies: 6 comments
-
Thanks for submitting the issue |
Beta Was this translation helpful? Give feedback.
-
Did you scale the batch size linearly with the number of GPUs? |
Beta Was this translation helpful? Give feedback.
-
@eric-haibin-lin |
Beta Was this translation helpful? Give feedback.
-
Hi @liu6381810, Seems like there's an overhead to using multiple GPUs, and one possible source is the transfer of gradients between GPUs. Are you using an AWS EC2 p3.16xlarge instance for this, or do you have your own server here? Check |
Beta Was this translation helpful? Give feedback.
-
@liu6381810 Did the previous suggestions help you in this case ? |
Beta Was this translation helpful? Give feedback.
-
@liu6381810 - Waiting for you updates. |
Beta Was this translation helpful? Give feedback.
-
Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io
Description
trainning with multi-gpu is much slower than single gpu
Environment info (Required)
Ubuntu16.04
python2.7
latest mxnet
I use symbol not gloun
8 * V100
Details:
Actually I need train fm for Recommender Systems,and I don't follow the sparse example in mxnet
I use the embedding layer to do this
But I find if I use one V100 with batch_size=8192 The speed is 40w+ sample
and if I change to use 8*V100 with batch_size=8192 which means every V100 got batch_size=1024 then I get 1.8W+sample per second
It shows with 8 * V100 is much slower than one V100
I don't think the limit is data io because I can get 40w+ sample per second with one V100
Also the there is a gpu whose used memory is larger than other 7 gpu's used memory
And its utilize is also higher
I think it's because my kvstore is set to device and this gpu need to collect gradient and update
I also use 40 fc layers to fit data generated by y=ax+b
2-gpu is also slower than 1-gpu with the same batch_size
So what's the reason about this?I would appreciate it if anyone can give any advice!
Below is my symbol
Beta Was this translation helpful? Give feedback.
All reactions