Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

Open
vasanth1986 opened this issue Jul 4, 2023 · 0 comments
Open

[Efficientdet/TF2] Training script to 4 GPUs systems #1323

vasanth1986 opened this issue Jul 4, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@vasanth1986
Copy link

TensorFlow2/Detection/Efficientdet: Please help me to modify train.py or other *.py files to run on 4 GPUs systems

The convergence script (convergence-AMP-8xA100-80G.sh) is running without any issue on 8 GPUs systems. However, getting Missing ranks and Horovod internal error when running the same script with modification to CUDA_VISIBLE_DEVICES=0,1,2,3(4 GPUs).

Error Message:

[2023-06-27 10:46:24.525097: W /opt/tensorflow/horovod-source/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
1: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
2: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
Traceback (most recent call last):
File "train.py", line 336, in
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "train.py", line 231, in main
history = model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0' defined at (most recent call last):
File "train.py", line 336, in
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "train.py", line 231, in main
history = model.fit(
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1384, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1021, in train_function
return step_function(self, iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1010, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1000, in run_step
outputs = model.train_step(data)
File "/workspace/effdet-tf2/utils/train_lib.py", line 388, in train_step
scaled_gradients = tape.gradient(scaled_loss, trainable_vars)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 774, in gradient
return self._allreduce_grads(gradients, sources)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in allreduce_grads
return [_allreduce_cond(grad,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in
return [_allreduce_cond(grad,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 253, in _allreduce_cond
return tf.cond(tf.logical_and(
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 248, in allreduce_fn
return allreduce(tensor, *args, process_set=process_set, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 125, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py", line 123, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
File "", line 107, in horovod_allreduce
_Node: 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0'
ncclCommInitRank failed: internal error
[[{{node DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0}}]] [Op:_inference_train_function_174717]

@vasanth1986 vasanth1986 added the enhancement New feature or request label Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant