Slow speed on distributed training #10247

Jeffery4000 · 2018-03-26T13:39:13Z

Jeffery4000
Mar 26, 2018

Hi, I'm trying to work out with multiple node with 4 GPUs each after able to achieve almost linear speed up on one single node with 4 GPUs. The dataset had been used was the Imagenet_2012 and tested on resnet50 base on the example provided by mxnet.

I fire up the code by launching the command below and I am trying to fully utilized GPU memory by controlling the batch size and use the maximum thread for data decoding, after running the test-io benchmark, each node are able to decode approximate 1000 sample/sec with 15 threads :

python ../../tools/launch.py -n 2 -H hosts python train_imagenet.py --network resnet --num-layers 50 --gpus 0,1 --batch-size 180 --data-nthreads 15 --top-k 5 --data-train ./data/train_data.rec --data-val ./data/val_data.rec --num-epochs 1 --kv-store dist_device_sync

Result:
Single node with 1 GPUs: 122 samples/sec
Single node with 2 GPUs: 234 samples/sec
Single node with 3 GPUs: 353 samples/sec
Single node with 4 GPUs: 477 samples/sec
Two node with 1 GPUs each: 111 samples/sec on each machine = 222 samples/sec
Two node with 2 GPUs each: 220 samples/sec on each machine = 440 samples/sec
Two node with 3 GPUs each: 300 samples/sec on each machine = 600samples/sec
Two node with 4 GPUs each: 355 samples/sec on each machine = 710 samples/sec

From the result above, I just wan to understand why there is a great reduce in speed base when its training on two node with 4 GPUs each.

Jeffery4000 · 2018-03-26T14:36:47Z

Jeffery4000
Mar 26, 2018
Author

To clarify, I'm able to achieve result below with CIFAR10 dataset with resnet-152 with the same configuration.

Single node with 1 GPUs: 980 samples/sec
Single node with 2 GPUs: 1910 samples/sec
Single node with 3 GPUs: 2890 samples/sec
Single node with 4 GPUs: 3850 samples/sec
Two node with 1 GPUs each: 970 samples/sec on each machine = 1940 samples/sec
Two node with 2 GPUs each: 1830 samples/sec on each machine = 3660 samples/sec
Two node with 3 GPUs each: 2650 samples/sec on each machine = 5300 samples/sec
Two node with 4 GPUs each: 3510 samples/sec on each machine = 7020 samples/sec

0 replies

anirudhacharya · 2018-03-30T22:25:04Z

anirudhacharya
Mar 30, 2018
Collaborator

@nswamy @sandeep-krishnamurthy please tag this - "Python", "GPU", "Question", "Distributed"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow speed on distributed training #10247

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Slow speed on distributed training #10247

Jeffery4000 Mar 26, 2018

Replies: 2 comments

Jeffery4000 Mar 26, 2018 Author

anirudhacharya Mar 30, 2018 Collaborator

Jeffery4000
Mar 26, 2018

Jeffery4000
Mar 26, 2018
Author

anirudhacharya
Mar 30, 2018
Collaborator