-
When attempting to start spark with RAPIDS, I consistently face issues where the executors lose connection, indicated by the message "command exited with code 1," followed by reconnection attempts, which are repeatedly made.
Additionally, when using spark-submit, I encounter a warning regarding GPU overrides:
This warning is followed by further losses of executors. nvidia-smi:
versions:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hello,
Can you check the executor logs? Many times those will have details as to why the executors are lost. Depending on the resource manager you are using, many times you can see them in the Spark UI under the "Executors" tab. If you are running on something like Kubernetes you may have to try to use kubectl logs on the pod or if you are aggregating them somewhere look there. |
Beta Was this translation helpful? Give feedback.
My issue has been resolved. I reinstalled both CUDA and the NVIDIA driver. Additionally, users on the nodes need to join the video group.