-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread callback for OpenMP backend #4770
Comments
This is hopefully fixed by the one-liner in #4775 (though I still need to track down the source of the spurious queue entry) |
Thanks. I just tested the fix but now the program hangs at the first step of the callback loop. Here's the backtrace:
And some additional info
(Side question: Is it expected that even when setting a sequential callback, OpenBLAS still creates that many threads ?) |
Strange, I ran several tests with and without valgrind and did not encounter any hang yesterday. Is this still with the unmodified testcase that you posted above ? I guess there is a small chance that this may be a "generic" lockup from combining OpenMP and non-OpenMP threading which don't know of each other's existence. (And my understanding is that in this iteration of the user-defined threading callback, the OpenBLAS mechanisms for obtaining the initial thread count are unchanged. @shivammonaka can you comment on this issue ? |
Yes with the exact same reproducer, with a sequential loop or with an OpenMP based parallel loop. In addition, it works if I set |
I cannot reproduce the problem (on x86_64 hardware of various sizes up to 64 cores and with multiple compiler versions) with my PR in place. Did you do a full rebuild (i.e. |
Sorry, there was a slight difference with the original reproducer. I explicitely set the number of openmp threads for the for loop: void omp_cb (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
#pragma omp parallel for num_threads(4) // <-- here
for(int i = 0; i < numjobs; i++)
{
printf("thread: %d, i: %d\n", omp_get_thread_num(), i);
void *element_adrr = (void *) (((char *)jobdata) + ((unsigned) i)*jobdata_elsize);
dojob(i, element_adrr, dojob_data);
}
return;
} Then if I set |
Hi @martin-frbg, @jeremiedbb , Sorry for the late reply. I was preoccupied with some other stuff. Let me explain my understanding of this error Question : Why does this happens : - Answer :
I am currently working on fixing this thing only. I want to architecturally change the execution of the algorithm so that
I have a solution ready for OpenMP backend but currently its in testing phase. I will try to get it out ASAP. |
Seems to be related to requesting more OMP threads than there are iterations in the for loop, possibly the two independent threading models each having an idle thread and both waiting for each other to make the first move... |
It actually requires Number of OMP Threads on caller side >= Number of iteration in the for loop. BLAS wants each iteration of the loop to be executed by a different thread otherwise it goes into infinite waiting. Sadly, That's how |
Hi,
I'm trying to leverage #4577 in a project (scikit-learn) that has a mix of OpenMP and OpenBLAS built with the pthreads threading layer to make OpenBLAS use the OpenMP threadpool. Ideally we'd use an OpenBLAS built with the OpenMP threading layer but it's outside of our control because it comes from a dependency.
I naively tried the example callback presented in #4577 (comment), but I can't make it work. Here's a simple reproducer, just a gemm:
test.c
Compile command:
It just results in a segfault at the first step of the loop. Note that it still segfaults if I remove the omp pragma and just use a sequential loop in the callback.
Any help would be greatly appreciated.
The text was updated successfully, but these errors were encountered: