[Efficientdet/TF2] Training script to 4 GPUs systems #1323

vasanth1986 · 2023-07-04T14:28:48Z

TensorFlow2/Detection/Efficientdet: Please help me to modify train.py or other *.py files to run on 4 GPUs systems

The convergence script (convergence-AMP-8xA100-80G.sh) is running without any issue on 8 GPUs systems. However, getting Missing ranks and Horovod internal error when running the same script with modification to CUDA_VISIBLE_DEVICES=0,1,2,3(4 GPUs).

Error Message:

[2023-06-27 10:46:24.525097: W /opt/tensorflow/horovod-source/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
1: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
2: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
Traceback (most recent call last):
File "train.py", line 336, in
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "train.py", line 231, in main
history = model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0' defined at (most recent call last):
File "train.py", line 336, in
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "train.py", line 231, in main
history = model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1384, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1021, in train_function
return step_function(self, iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1010, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1000, in run_step
outputs = model.train_step(data)
File "/workspace/effdet-tf2/utils/train_lib.py", line 388, in train_step
scaled_gradients = tape.gradient(scaled_loss, trainable_vars)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 774, in gradient
return self._allreduce_grads(gradients, sources)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in allreduce_grads
return [_allreduce_cond(grad,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in
return [_allreduce_cond(grad,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 253, in _allreduce_cond
return tf.cond(tf.logical_and(
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 248, in allreduce_fn
return allreduce(tensor, *args, process_set=process_set, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 125, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py", line 123, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
File "", line 107, in horovod_allreduce
_Node: 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0'
ncclCommInitRank failed: internal error
[[{{node DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0}}]] [Op:_inference_train_function_174717]

vasanth1986 added the enhancement New feature or request label Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

vasanth1986 commented Jul 4, 2023

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

Comments

vasanth1986 commented Jul 4, 2023