Replies: 2 comments 4 replies
-
Thanks for your question and providing details about the issue you are encountering. We will investigate the exception and get back to you with next steps. One question -- in what environment are you running Spark (e.g. YARN)? The CPU out of memory error is happening in off heap memory which is controlled by the |
Beta Was this translation helpful? Give feedback.
-
Looking at the code I think we have to have a bug of some kind where we are throwing the wrong type of exception CPU OOM vs GPU OOM. The reason I say this is that is the retry block that explicitly is saying that it does not support a split and retry. Can you share what type of GPU you are running on? Also are there other processes that might be sharing that same GPU? The reason I ask is that the processing you are doing does not look like it would cause any OOM issues, either CPU or GPU. But I cannot see the entire query so it is a little hard to tell. But what I can see is a coalesce batches followed by a sort, which then goes into a columnar to row, some processing on the CPU and then a row to columnar followed by a shuffle. All of that code has been extensively tested and if we run out of memory the performance may not be great, but we should not crash. I'll try to reproduce this locally and see what I can come up with. |
Beta Was this translation helpful? Give feedback.
-
The exception is thrown from executor, but the executor memory should be enough (50gb). Besides, I use top command to monitor the memory usage for all nodes, the usage is never higher than 50%. How do I prevent this OOM error or any way or tool can help me find the root cause ?
Beta Was this translation helpful? Give feedback.
All reactions