Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary numbers of threads for memory allocation. #1739

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

oon3m0oo
Copy link
Contributor

When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen (#1735) show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.

This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.

This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation (#1720). The overall performance impact was < 1%, anyway. Removing it also simplifies the code.

@oon3m0oo
Copy link
Contributor Author

Hmm... I'm not sure why Windows suddenly can't find ::Tls*.

@martin-frbg
Copy link
Collaborator

Checking if this is a problem with the CI - local Linux builds show no errors so far.

@martin-frbg
Copy link
Collaborator

I wonder if the Windows build issue comes simply from using C++ idiom in a plain C file (which would also imply the problem was in your earlier code as well, just masked by the default use of compiler TLS).

@oon3m0oo
Copy link
Contributor Author

oon3m0oo commented Aug 17, 2018

Aha, you're probably right. I've removed the ::; let's see what happens.

In the meantime I can reproduce the test failures and am actively debugging.

@oon3m0oo
Copy link
Contributor Author

Yep, that was it. I've fixed the tests, too. The failures are make timeouts (45 minutes). Is there a way I can restart them?

@martin-frbg
Copy link
Collaborator

Not sure if you as the submitter already have the right to restart individual jobs, or perhaps would need to set up a free account with travis. (you should see a symbol with two concetric circles to the right of the time column in the travis jobs table if you can restart jobs by clicking this). I have now restarted the three jobs in question, but the issue with the timeouts could only be resolved by changing the limit in the .travis.yml file. However, these three jobs used to take around twenty minutes like the others, so "make timeout" here probably means a lockup in the tests. (We will see...)

@oon3m0oo
Copy link
Contributor Author

I've tried all three locally and everything was fine... not sure what's happening with travis.

@martin-frbg
Copy link
Collaborator

I cannot reproduce this either. Travis status blog has a few recent entries about jobs that require "sudo: true" intermittently failing or hanging. It is just worrying that these three jobs should fail repeatedly...

@oon3m0oo
Copy link
Contributor Author

Yeah, it is. Is there any way to get more verbose output of those jobs?

@martin-frbg
Copy link
Collaborator

I could remove the settings for "quiet make" from the travis.yml, but ironically travis used to abort jobs for creating too much output without it.

@martin-frbg
Copy link
Collaborator

Trying some things in my fork now. But without "QUIET", travis tends to kill the job as it exceeds 4MB of output (mostly from warnings in the LAPACK part)...

@brada4
Copy link
Contributor

brada4 commented Aug 17, 2018

Travis also kills the absolutely silent jobs...
make -s ?
I think it barfs out less than a megabyte now.

@martin-frbg
Copy link
Collaborator

Turns out it is the fork safety utest that is hanging.

@brada4
Copy link
Contributor

brada4 commented Aug 18, 2018

Linux timeout command can do it, no posix equivalent, but particular test is only linux...

@martin-frbg
Copy link
Collaborator

@brada4 this does not answer the question why the fork test fails with this change.

@brada4
Copy link
Contributor

brada4 commented Aug 18, 2018

Just to fix up CI

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 18, 2018

The hang is reproducible with (a vm running) the same version of Ubuntu locally, so probably related to either glibc 2.15 or gcc 4.6.3 . gdb shows one thread in __libc_thread_freeres() called from clone() via start_thread, with the other thread in pthread_join called from blas_thread_shutdown at line 982 of blas_server.c. Not sure yet why this would happen only after this change, when the only relevant difference appears to be in blas_memory_cleanup that happens after the call to blas_thread_shutdown.

@martin-frbg
Copy link
Collaborator

helgrind sees a race between calls to blas_memory_alloc from memory.c:1071 (reading with no lock held, called from dgemm, gemm.c:403) and memory.c:1097 (writing with a lock held), and another between get_memory_table at memory.c:489 and pthread_key_create called from memory.c:1031

@martin-frbg
Copy link
Collaborator

Seems access to the newly introduced local_storage_key variable is racey, but adding another lock to protect it does not help.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 19, 2018

I believe I fixed the race conditions in the "mem" branch of my fork (at the expense of another mutex lock), but the issue with travis looks like a glibc < 2.23 bug. If I move the CI to Ubuntu xenial, the utest passes.
The question now becomes how to handle these older systems: An #if !defined(__GLIBC_PREREQ) || __GLIBC_PREREQ(2,23) for the "new" code is easy enough, but what to fall back to in the #else branch - pre-0.3.1 memory.c, or the 0.3.2 memory.c without compiler TLS and with an increased MAX_ALLOCATING_THREADS ?

@oon3m0oo
Copy link
Contributor Author

I've added a tiny bit of locking that ought to resolve things... we'll see. I suppose it depends on what the issue with glibc is. The change in logic here might be enough to avoid the issue altogether (crosses fingers).

@martin-frbg
Copy link
Collaborator

My impression was that it was somehow related to using malloc in the threads, but I do not have the bug ids handy now. We'll see what travis says now, and I'll also give it a spin with the ubuntu vm later when I have time.

@martin-frbg
Copy link
Collaborator

Does not look good unfortunately, I see local segfaults (at memory.c:1128) even with relatively current systems as well. I must admit I am beginning to think we should revert to the 0.3.0 memory.c for 0.3.3 to give users a stable release to work with, and reintroduce the rewrite when it is stabilized and tested.

@oon3m0oo
Copy link
Contributor Author

@martin-frbg what tests are you running?

@martin-frbg
Copy link
Collaborator

Normal build (on opensuse leap 42.2, gcc-4.8.5, glibc 2.22) was enough to trigger a segfault in the sblat2 test with your latest version...

@oon3m0oo
Copy link
Contributor Author

All I have is 2.24 but I can't reproduce any crashes. Everything seems to run fine, which makes all of this more complicated. I'm trying to see if there's a good way for me to get something were I can reproduce this, but if you have any ideas, I'd be grateful.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 20, 2018

Virtualbox with a bunch of vm's from osboxes.org or similar ? Also running valgrind --tool=helgrind (or just the default memcheck mode valgrind) on the blas tests and/or the utest binary is often helpful.

@martin-frbg
Copy link
Collaborator

BTW the sblat2 test crashes with alloc_table=0x0 (making alloc_table[0] an illegal access) in line 1128 of memory.c

@martin-frbg
Copy link
Collaborator

Seems to me the crucial problem is that your revised code is missing a call to blas_tls_init() at the start of blas_memory_alloc() (or the static int tls_initialize there is not working as intended for whatever reason). Net effect seems to be that get_memory_table gets invoked with an undefined local_storage_key.

@oon3m0oo
Copy link
Contributor Author

oon3m0oo commented Aug 22, 2018

Actually it gets called from blas_memory_init, from blas_memory_alloc. And somehow that line is gone. I'll add it back in.

@oon3m0oo oon3m0oo force-pushed the dynamic branch 2 times, most recently from e9ced31 to 0b0177b Compare August 22, 2018 10:21
When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS.  The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server.  Thus we have to be able to support an arbitrary number of threads that can come in any time.  The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive.  Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.

This change gives each thread its own local allocation table, and completely does away with the global table.  We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.

This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation.  The overall performance impact was < 1%, anyway.  Removing it also simplifies the code.

Support arbitrary numbers of threads for memory allocation.

When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS.  The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server.  Thus we have to be able to support an arbitrary number of threads that can come in any time.  The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive.  Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.

This change gives each thread its own local allocation table, and completely does away with the global table.  We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.

This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation.  The overall performance impact was < 1%, anyway.  Removing it also simplifies the code.
@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 22, 2018

Seems the timeouts/lock ups in the "fork" utest are still there on older systems. My quick test (with the Ubuntu CI setup updated to avoid the perceived glibc "fork test" problem) passed everything except the mingw cross-build. (Not sure yet what happened there, could be same glibc problem). Still not quite happy with putting this in 0.3.3, at least the previous version did not lock up on older systems.

@oon3m0oo
Copy link
Contributor Author

Hmm... I have to admit it is rather frustrating that I can run something locally and have everything pass and run fine (no seg faults), and end up with failures on Travis. I don't envy the huge portability aspect in pulling this project together. I do wish that the logs at least would have more details in the case of failure (it seems like only one job gives verbose output).

@martin-frbg
Copy link
Collaborator

I have hacked my .travis.yml into your branch now, but the problem with the old Ubuntu "Precise" version will probably need to be debugged in a local installation. I believe the earlier problems with segfaults were a direct consequence of invoking undefined behaviour (by omitting tls key initialization in that one code path), so did not affect all Linux flavors int the CI and may not have occured on whatever you use for local testing.

@martin-frbg
Copy link
Collaborator

Checking "my" version locally, I "only" get the fork hang with the old Ubuntu systems, but not the segfaults of "your" version. Note the version in my fork has added safeguards against alloc_table itself being NULL, where your original only checked alloc_table[position]...

@martin-frbg
Copy link
Collaborator

I have now put my modified version of the code into #1742, which would again make the original memory.c the default unless compiled with -DUSE_TLS (and on a system with a sufficiently capable libc).
I think this should provide a reliable solution for 0.3.3, although it puts the burden of choosing on the end user (or distribution packager) for now. Once the remaining issues are resolved, we can just change the default setting of USE_TLS .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants