Slow speed on distributed training #10247
Replies: 2 comments
-
To clarify, I'm able to achieve result below with CIFAR10 dataset with resnet-152 with the same configuration. Single node with 1 GPUs: 980 samples/sec |
Beta Was this translation helpful? Give feedback.
-
@nswamy @sandeep-krishnamurthy please tag this - "Python", "GPU", "Question", "Distributed" |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm trying to work out with multiple node with 4 GPUs each after able to achieve almost linear speed up on one single node with 4 GPUs. The dataset had been used was the Imagenet_2012 and tested on resnet50 base on the example provided by mxnet.
I fire up the code by launching the command below and I am trying to fully utilized GPU memory by controlling the batch size and use the maximum thread for data decoding, after running the test-io benchmark, each node are able to decode approximate 1000 sample/sec with 15 threads :
python ../../tools/launch.py -n 2 -H hosts python train_imagenet.py --network resnet --num-layers 50 --gpus 0,1 --batch-size 180 --data-nthreads 15 --top-k 5 --data-train ./data/train_data.rec --data-val ./data/val_data.rec --num-epochs 1 --kv-store dist_device_sync
Result:
Single node with 1 GPUs: 122 samples/sec
Single node with 2 GPUs: 234 samples/sec
Single node with 3 GPUs: 353 samples/sec
Single node with 4 GPUs: 477 samples/sec
Two node with 1 GPUs each: 111 samples/sec on each machine = 222 samples/sec
Two node with 2 GPUs each: 220 samples/sec on each machine = 440 samples/sec
Two node with 3 GPUs each: 300 samples/sec on each machine = 600samples/sec
Two node with 4 GPUs each: 355 samples/sec on each machine = 710 samples/sec
From the result above, I just wan to understand why there is a great reduce in speed base when its training on two node with 4 GPUs each.
Beta Was this translation helpful? Give feedback.
All reactions