Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect synchronization in private server #739

Merged
merged 4 commits into from
Jan 26, 2022

Conversation

pavelkumbrasev
Copy link
Contributor

Signed-off-by: pavelkumbrasev [email protected]

Description

Patch fixes incorrect happens-before between worker thread and finalize.
There was data race between commit_wait on worker thread and thread that calls finalize -> start_shutdown for worker thread.
Because there was no happens-before shutdowning thread reads incorrect private_worker data and that leads to hang on join() of private_worker.

Current patch fixes test_global_control for #771 issue.

  • - git commit message contains appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

  • bug fix - change which fixes an issue
  • new feature - change which adds functionality
  • tests - change in tests
  • infrastructure - change in infrastructure and CI
  • documentation - documentation update

Tests

  • added - required for new features and for some bug fixes
  • not needed

Documentation

  • updated in # - add PR number
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

@alexey-katranov @phprus @anton-potapov

Other information

@phprus
Copy link
Contributor

phprus commented Jan 19, 2022

Test test_global_control hangs again (first hang after 1200 runs) :(

Commit: 0ef8048
Backtrace for test_global_control (release build with LTO):

phprus@mbp release % ps ax | grep test
31860 s011  S+     0:04.12 ./appleclang_13.0_cxx17_64_release/test_global_control
31876 s014  R+     0:00.00 grep test
phprus@mbp release % lldb --attach-pid 31860
(lldb) process attach --pid 31860
Process 31860 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00000001b079370c libsystem_kernel.dylib`__ulock_wait + 8
libsystem_kernel.dylib`__ulock_wait:
->  0x1b079370c <+8>:  b.lo   0x1b079372c               ; <+40>
    0x1b0793710 <+12>: pacibsp
    0x1b0793714 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1b0793718 <+20>: mov    x29, sp
Target 0: (test_global_control) stopped.

Executable module set to "/Users/phprus/Devel/oneapi-src/tmp/pr739/oneTBB-0ef804884856e791fe7bb173078a92daf3929f76/build/release/appleclang_13.0_cxx17_64_release/test_global_control".
Architecture set to: arm64e-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001b079370c libsystem_kernel.dylib`__ulock_wait + 8
    frame #1: 0x00000001b07cf574 libsystem_pthread.dylib`_pthread_join + 452
    frame #2: 0x00000001b072873c libc++.1.dylib`std::__1::thread::join() + 36
    frame #3: 0x000000010290632c test_global_control`DOCTEST_ANON_FUNC_51() + 108
    frame #4: 0x0000000102903694 test_global_control`main + 11016
    frame #5: 0x0000000102ced0f4 dyld`start + 520
(lldb) thread select 2
* thread #2
    frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
libsystem_kernel.dylib`__semwait_signal:
->  0x1b0794ebc <+8>:  b.lo   0x1b0794edc               ; <+40>
    0x1b0794ec0 <+12>: pacibsp
    0x1b0794ec4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1b0794ec8 <+20>: mov    x29, sp
(lldb) bt
* thread #2
  * frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
    frame #1: 0x00000001b069fd88 libsystem_c.dylib`nanosleep + 216
    frame #2: 0x00000001b0728820 libc++.1.dylib`std::__1::this_thread::sleep_for(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&) + 84
    frame #3: 0x0000000102919c08 test_global_control`TestBlockingTerminateNS::ExceptionTest2::Body::operator()(int) const + 168
    frame #4: 0x0000000102919948 test_global_control`tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<int>, tbb::detail::d1::parallel_for_body_wrapper<TestBlockingTerminateNS::ExceptionTest2::Body, int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) + 1204
    frame #5: 0x0000000102a9f104 libtbb.12.6.dylib`tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) + 1028
    frame #6: 0x000000010291915c test_global_control`tbb::detail::d1::task_arena_function<TestBlockingTerminateNS::ExceptionTest2::operator()()::'lambda'(), void>::operator()() const + 276
    frame #7: 0x0000000102a8f458 libtbb.12.6.dylib`tbb::detail::r1::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) + 312
    frame #8: 0x0000000102909344 test_global_control`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void utils::NativeParallelFor<int, DOCTEST_ANON_FUNC_51()::$_4>(int, DOCTEST_ANON_FUNC_51()::$_4 const&)::'lambda'()> >(void*) + 432
    frame #9: 0x00000001b07cd240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) thread select 3
* thread #3
    frame #0: 0x00000001b0791990 libsystem_kernel.dylib`semaphore_wait_trap + 8
libsystem_kernel.dylib`semaphore_wait_trap:
->  0x1b0791990 <+8>: ret

libsystem_kernel.dylib`semaphore_wait_signal_trap:
    0x1b0791994 <+0>: mov    x16, #-0x25
    0x1b0791998 <+4>: svc    #0x80
    0x1b079199c <+8>: ret
(lldb) bt
* thread #3
  * frame #0: 0x00000001b0791990 libsystem_kernel.dylib`semaphore_wait_trap + 8
    frame #1: 0x0000000102a9c0fc libtbb.12.6.dylib`tbb::detail::r1::rml::private_worker::thread_routine(void*) + 680
    frame #2: 0x00000001b07cd240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) thread select 4
* thread #4
    frame #0: 0x00000001b0791990 libsystem_kernel.dylib`semaphore_wait_trap + 8
libsystem_kernel.dylib`semaphore_wait_trap:
->  0x1b0791990 <+8>: ret

libsystem_kernel.dylib`semaphore_wait_signal_trap:
    0x1b0791994 <+0>: mov    x16, #-0x25
    0x1b0791998 <+4>: svc    #0x80
    0x1b079199c <+8>: ret
(lldb) bt
* thread #4
  * frame #0: 0x00000001b0791990 libsystem_kernel.dylib`semaphore_wait_trap + 8
    frame #1: 0x0000000102a9c0fc libtbb.12.6.dylib`tbb::detail::r1::rml::private_worker::thread_routine(void*) + 680
    frame #2: 0x00000001b07cd240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) thread select 5
* thread #5
    frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
libsystem_kernel.dylib`__semwait_signal:
->  0x1b0794ebc <+8>:  b.lo   0x1b0794edc               ; <+40>
    0x1b0794ec0 <+12>: pacibsp
    0x1b0794ec4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1b0794ec8 <+20>: mov    x29, sp
(lldb) bt
* thread #5
  * frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
    frame #1: 0x00000001b069fd88 libsystem_c.dylib`nanosleep + 216
    frame #2: 0x00000001b0728820 libc++.1.dylib`std::__1::this_thread::sleep_for(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&) + 84
    frame #3: 0x0000000102919c08 test_global_control`TestBlockingTerminateNS::ExceptionTest2::Body::operator()(int) const + 168
    frame #4: 0x0000000102919948 test_global_control`tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<int>, tbb::detail::d1::parallel_for_body_wrapper<TestBlockingTerminateNS::ExceptionTest2::Body, int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) + 1204
    frame #5: 0x0000000102a8d56c libtbb.12.6.dylib`tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) + 1444
    frame #6: 0x0000000102a99a9c libtbb.12.6.dylib`tbb::detail::r1::market::process(rml::job&) + 52
    frame #7: 0x0000000102a9bf60 libtbb.12.6.dylib`tbb::detail::r1::rml::private_worker::thread_routine(void*) + 268
    frame #8: 0x00000001b07cd240 libsystem_pthread.dylib`_pthread_start + 148
(lldb) thread select 6
* thread #6
    frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
libsystem_kernel.dylib`__semwait_signal:
->  0x1b0794ebc <+8>:  b.lo   0x1b0794edc               ; <+40>
    0x1b0794ec0 <+12>: pacibsp
    0x1b0794ec4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x1b0794ec8 <+20>: mov    x29, sp
(lldb) bt
* thread #6
  * frame #0: 0x00000001b0794ebc libsystem_kernel.dylib`__semwait_signal + 8
    frame #1: 0x00000001b069fd88 libsystem_c.dylib`nanosleep + 216
    frame #2: 0x00000001b0728820 libc++.1.dylib`std::__1::this_thread::sleep_for(std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > const&) + 84
    frame #3: 0x0000000102919c08 test_global_control`TestBlockingTerminateNS::ExceptionTest2::Body::operator()(int) const + 168
    frame #4: 0x0000000102919948 test_global_control`tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<int>, tbb::detail::d1::parallel_for_body_wrapper<TestBlockingTerminateNS::ExceptionTest2::Body, int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) + 1204
    frame #5: 0x0000000102a8d56c libtbb.12.6.dylib`tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) + 1444
    frame #6: 0x0000000102a99a9c libtbb.12.6.dylib`tbb::detail::r1::market::process(rml::job&) + 52
    frame #7: 0x0000000102a9bf60 libtbb.12.6.dylib`tbb::detail::r1::rml::private_worker::thread_routine(void*) + 268
    frame #8: 0x00000001b07cd240 libsystem_pthread.dylib`_pthread_start + 148
(lldb)

Output:

[doctest] doctest version is "2.4.7"
[doctest] run with "--help" for options










^C===============================================================================
/Users/phprus/Devel/oneapi-src/tmp/pr739/oneTBB-0ef804884856e791fe7bb173078a92daf3929f76/test/tbb/test_global_control.cpp:239:
TEST CASE:  prolong lifetime advanced

/Users/phprus/Devel/oneapi-src/tmp/pr739/oneTBB-0ef804884856e791fe7bb173078a92daf3929f76/test/tbb/test_global_control.cpp:239: FATAL ERROR: test case CRASHED: SIGINT - Terminal interrupt signal

===============================================================================
[doctest] test cases: 3 | 2 passed | 1 failed | 1 skipped
[doctest] assertions: 9 | 9 passed | 0 failed |
[doctest] Status: FAILURE!

@phprus
Copy link
Contributor

phprus commented Jan 19, 2022

Test test_tbb_fork work without errors and hang (7000+ runs).

@phprus
Copy link
Contributor

phprus commented Jan 19, 2022

Test test_global_control debug build:

[doctest] doctest version is "2.4.7"
[doctest] run with "--help" for options
Assertion pred() failed (located in the enforce function, line in file: 173)
===============================================================================
/Users/phprus/Devel/oneapi-src/tmp/pr739/oneTBB-0ef804884856e791fe7bb173078a92daf3929f76/test/tbb/test_global_control.cpp:250:
TEST CASE:  prolong lifetime multiple wait

/Users/phprus/Devel/oneapi-src/tmp/pr739/oneTBB-0ef804884856e791fe7bb173078a92daf3929f76/test/tbb/test_global_control.cpp:250: FATAL ERROR: test case CRASHED: SIGABRT - Abort (abnormal termination) signal

===============================================================================
[doctest] test cases:  4 |  3 passed | 1 failed | 0 skipped
[doctest] assertions: 58 | 58 passed | 0 failed |
[doctest] Status: FAILURE!
zsh: abort      ./appleclang_13.0_cxx17_64_debug/test_global_control

Assert:
https://github.com/oneapi-src/oneTBB/blob/0ef804884856e791fe7bb173078a92daf3929f76/src/tbb/market.h#L171-L174

@pavelkumbrasev
Copy link
Contributor Author

@phprus Thanks a lot for testing! Now it's a different problem actually, so a least one I fixed.

@phprus
Copy link
Contributor

phprus commented Jan 20, 2022

@pavelkumbrasev Do I need to create a new issue?

@pavelkumbrasev
Copy link
Contributor Author

@pavelkumbrasev Do I need to create a new issue?

It would be great.

@phprus
Copy link
Contributor

phprus commented Jan 21, 2022

@pavelkumbrasev,
Issue #745

@pavelkumbrasev pavelkumbrasev force-pushed the dev/pavelkumbrasev/fix_race_in_private_server branch from 0ef8048 to a22919b Compare January 26, 2022 05:22
@pavelkumbrasev
Copy link
Contributor Author

@aleksei-fedotov @anton-potapov

@pavelkumbrasev pavelkumbrasev changed the title WIP: Fix incorrect synchronization in private server Fix incorrect synchronization in private server Jan 26, 2022
@alexey-katranov alexey-katranov merged commit 525a388 into master Jan 26, 2022
@alexey-katranov alexey-katranov deleted the dev/pavelkumbrasev/fix_race_in_private_server branch January 26, 2022 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests test_global_control, test_tbb_fork sporadically hangs on Apple Silicon
4 participants