Replies: 17 comments
-
@mxnet-label-bot add [question] |
Beta Was this translation helpful? Give feedback.
-
The code from NVIDIA MLPerf submission assumes running inside the container from NGC (https://ngc.nvidia.com) and will not work (yet) in the upstream MXNet. Please consult https://ngc.nvidia.com/catalog/containers/nvidia:mxnet for information on how to download and run the NGC MXNet container. |
Beta Was this translation helpful? Give feedback.
-
@ptrendx I know NGC container works, but I just want to run without container. Now I figured out NGC MXNet changed the Convolution operator by adding more parameters: cudnn_algo_verbose, cudnn_algo_fwd, cudnn_algo_bwd_data, cudnn_algo_bwd_filter, and cudnn_tensor_core_only. But they are not available in MXNet repo. Do you know whether the MXNet repo has any plan to integrate these changes made by Nvidia? Thanks. |
Beta Was this translation helpful? Give feedback.
-
We (the DL Framework team at NVIDIA) are working to upstream all performance changes with multiple PRs already issued (and several more to go), see e.g. #13346 #13749 #13471 . |
Beta Was this translation helpful? Give feedback.
-
@ptrendx Thanks for these information. If those additional arguments are only for debug purpose, then I can remove them in the benchmark code to match the APIs in the existing MXNet repo. |
Beta Was this translation helpful? Give feedback.
-
Hi @ptrendx Since I had performance issue when running on bare-metal, I also start to use the NGC container ngc18.11_mxnet to run the MLPerf mxnet resnet50 benchmark on our servers but it cannot converge. Each server has 4 V100-SXM2 32 GB. I ran the benchmark on two servers and set the parameters the same as DGX-1:
Here the batch size is 208 for each GPU, so the global batch size is 208*8=1664 which is the same batch size DGX-1 used in the MLPerf published result. But the model cannot reach the target accuracy 74.9% even with 100 epochs. The evaluation accuracy is 74.35% after 100 epochs (see the following figure). But DGX-1 reached 75.22% after only 62 epochs. So could you give some guidance on how to choose the parameters to make the model converge and converge faster? Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi @ptrendx, could you give some guidance on the question in my previous post? Also what is"--dali-nvjpeg-memory-padding" used for? Will it have impact to the model accuracy? |
Beta Was this translation helpful? Give feedback.
-
Oh, sorry, I completely missed that comment. How do you prepare you training dataset? The way we did it in our submission is shown e.g. here: https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5#prepare-dataset I don't see anything obviously wrong with the options you set. How does it look like when running on single machine (basically halve the lr parameter and run on 4 GPUs)? About the DALI options - they do not matter for convergence, they are there to avoid memory reallocations during training. |
Beta Was this translation helpful? Give feedback.
-
Hi @ptrendx, I used the same commands as in nvidia github to create the training dataset, but still could not reproduce nvidia's results. The training on single node (4 GPUs) works well, where the model reached 75.21% accuracy with 62 epochs. Is it possible to reproduce the same result as in nvidia's DGX-1? With single DGX-1, the --kv-store=device, but when multi-node is used, --kv-store=horovod and MPI is also used. The random seed is also different. If we want to reproduce DGX-1's result, should we use the same random seed as DGX-1 used for all 8 MPI processes (2 nodes)? But I know some cuDNN APIs also have non-deterministic results, so I am not sure whether this will work. But I will try. So here the general question is how to reproduce the single-node result on multi-node? Here I mean almost exactly the same result, although small floating point difference is acceptable. |
Beta Was this translation helpful? Give feedback.
-
Hmm, this is strange. On Monday/Tuesday I am travelling, but I will try to reproduce your results on Wednesday and see what is the issue there. Horovod and device kvstore should give similar results (not exactly the same because of the addition order in reduction) and definitely should both converge. |
Beta Was this translation helpful? Give feedback.
-
Thanks very much for helping check this issue! I really want to know why the result is not reproducible. |
Beta Was this translation helpful? Give feedback.
-
Hi @renganxu, could you post here exact command lines you tried (so the
|
Beta Was this translation helpful? Give feedback.
-
Hi @ptrendx, since my system could not install Docker, I converted the mxnet Docker container to Singularity container and then used Singularity container instead. My detailed command is:
The commands in the file container_cmd.sh is as follows:
The option "-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512" was added in MPI command because without it there will be error "A process failed to create a queue pair. This usually means either the device has run out of queue pairs (too many connections) or there are insufficient resources available to allocate a queue pair (out of memory). The latter can happen if either 1) insufficient memory is available, or 2) no more physical memory can be registered with the device." Those environment variables were added because Nvidia set them when creating the Docker container. I am also trying the code in https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5. I noticed this implementation is slightly different than the implementation in MLPerf repo. I will let you know whether it works or not. |
Beta Was this translation helpful? Give feedback.
-
Hmmm, I'm not sure how singularity works, are you sure that horovod sees all 8 ranks there (since you do mpirun on singularity and not on the actual train_imagenet.py script)? Because I could easily see it not converging if e.g. horovod sees only a subset of ranks (which would make learning rate too high). For example, how many copies of lines like this (with the same batch number) do you see?
It should be only 1 copy of each such line (so only 1 Batch [0-20], Batch [20-40]) etc. per epoch. |
Beta Was this translation helpful? Give feedback.
-
Please print It should be 8. |
Beta Was this translation helpful? Give feedback.
-
Hi @ptrendx , Horovod sees all 8 ranks there. Because with 1 node, the speed is ~5000 images/sec, and with 2 nodes, the speed becomes ~10000 images/sec. I added my log file in https://gist.github.com/renganxu/f68c4c680ad7e016bea8ee981f72c60c. Yes there is only one copy for those training steps, but there are two evaluation copies because two nodes are used. Could you help check what is the issue there? I found the Resnet-50 implementation in Nvidia DeepLearningExamples can converge successfully on 2 nodes, and up to 8 nodes (32 V100). This implementation uses parameter-server distribution model. I can try to implement it with Horovod. |
Beta Was this translation helpful? Give feedback.
-
Hmmm, your accuracy seems low from the start, maybe there is something wrong with initial parameter broadcast? Could you set seeds to be the same (just hardcode them) for all ranks in https://github.com/mlperf/results/blob/master/v0.5.0/nvidia/submission/code/image_classification/mxnet/train_imagenet.py#L109-L114? |
Beta Was this translation helpful? Give feedback.
-
Description
The was error "mxnet.base.MXNetError: Cannot find argument 'cudnn_algo_verbose'" when I ran the resnet50 model from Image Classification in MLPerf benchmark.
Environment info (Required)
Package used (Python/R/Scala/Julia): Python
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): gcc 7.2.0
MXNet commit hash: f95e794
Build config:
Error Message:
Minimum reproducible example
The mxnet implementation of image classification model Resnet50 in MLPerf:
https://github.com/mlperf/results/tree/master/v0.5.0/nvidia/submission/code/image_classification/mxnet
Steps to reproduce
first install the dependence nvidia-dali:
then run the benchmark:
What have you tried to solve it?
Beta Was this translation helpful? Give feedback.
All reactions