-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry and Suggestions Regarding OpenBLAS Code Flow with OpenMP #4418
Comments
regarding 1. The code involving max_parallel_number is only a stopgap "solution" (from #1536) for the case when an OpenMP program makes two or more concurrent calls to BLAS functions and the calls to exec_blas would end up using the same queue buffers. For calls from within an OpenMP parallel region, OpenBLAS will currently use a single thread only (see omp_in_parallel() conditional in common_thread.h). Generally, the OpenMP code in OpenBLAS for the most part dates back to the original GotoBLAS2 of 15 years ago, so tends not to use more recent OpenMP features. I am currently trying to unravel all this, but it is turning into a bigger task than I anticipated... 2., Unfortunately it is not quite clear to me what you are trying to do with using fewer threads inside the level3 code, possibly you are accidentally creating a situation where part of the code still expects all threads to be in use ? At least I am not aware of deadlocks in "normal" operation. 3., the pthreads code already utilizes a configurable THREAD_TIMEOUT for putting idle threads to sleep, the omp server code appears to rely on the OMP_WAIT_POLICY of the OpenMP implementation to do the right thing. The busy-waiting inside the inner_thread is a feature inherited from GotoBLAS2 (for which almost no documentation of implementation details exists) |
Hi @martin-frbg
|
It is entirely possible that you know more about the topic than me, and as mentioned the core parts here stem from a time when large, fast multiprocessor machines barely existed. (The pthreads build was not thread safe at all back then, I suspect OpenMP was mostly used as an opaque way of getting around that) |
FYI it appears that taskyield is an empty stub in GCC's libgomp - the situation appears to be better with LLVM's libomp at least from about 2020 onwards but I have not checked in depth if it is implemented as anything fundamentally better than the current YIELDING. |
While I'm still looking into this, unfortunately I have not seen any gains from your suggestions - though perhaps I am doing something wrong. Did you try with omp_set_lock() yourself ? The busy waiting is a complicated topic, we've discussed and experimented with various alternatives since years (e.g. #1051) without conclusive results and I notice the BLIS project is stuck in the same predicament (e.g. their issue #604). Lastly, if you want to restrict the number of threads for a given BLAS call, you'd need to do that at the interface stage (or by invoking openblas_set_num_threads() in the calling program) to keep the thread count information consistent across |
I'm told that this busy-wait/spin-lock is also causing an issue with #4577 where they are trying to plug in alternative threading backends like TBB. Fundamentally, if you are using a backend like TBB or Cilk or partr, you give it tasks that could be executed in parallel, but you have no control over whether they actually are executed in parallel. So you cannot use a spin-lock in the tasks that assumes they are running in parallel, and you cannot use That means that, in that context, they will need to replace the spin-lock with something else, either:
For performance, it might be possible to pass additional hints to the runtime scheduler of TBB (or whatever) hinting that certain tasks should be scheduled onto certain threads. But fundamentally, you might lose some parallel performance as a tradeoff for attaining better composability. |
I guess we could create a separate "spin-lock" function that would have to be overridden in the callback from #4577. Not sure if it is possible to determine at runtime what the threading backend is (or even just if it is TBB), but there is probably no way to come up with a backend-agnostic synchronization mechanism(?). |
You could have a function if (waitgroup_callback != NULL) {
waitgroup_callback(groupinfo);
}
else {
# busy-wait
} where there would be some API to set |
Hello,
I've been delving into the OpenBLAS codebase, specifically focusing on the gemm_driver function in level3_thread.c with the
#USE_OPENMP=1
flag enabled. I've come across a section in the code where a lock is used in the Pthreads and Win32 backend before initializing and assigning the blasqueue. The lock is released only after the exec_blas call is completed.My current understanding is :
No_of_threads < nthreads
(number of queue partitions) defined in level3_thread.c, I encounter a deadlock within the inner_threads function.I have a few questions and want to seek suggestions from the community:
I noticed that OpenMP locking mechanisms like omp_set_lock are not used, and instead, busy-waiting is implemented inside the exec_blas function in blas_server_omp.c using max_parallel_number. Could you please provide insights into this choice?
The inner_thread function appears to encounter deadlock when used with fewer threads. I tested this in OpenMP by creating a parallel region inside exec_blas in line 437 - blas_server_omp.c with a fixed number of threads less than nthreads. Could you shed some light on the reasons behind this behavior?
Considering the busy-waiting in the code, have there been considerations for putting threads to sleep or employing other synchronization methods to enhance efficiency?
I would appreciate clarification on these points.
The text was updated successfully, but these errors were encountered: