Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogManager::disk_thread run closures in bthread after the log manager shutdown, to avoid deadlock #242

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PFZheng
Copy link
Collaborator

@PFZheng PFZheng commented Nov 4, 2020

In the situation, the closure may hold the last reference of NodeImpl, and done->Run() triggers the
the destructor of NodeImpl. In the destructor, NodeImpl joins the disk thread to terminate, but
the done->Run() itself is excuted in the disk thread. The disk thread can never be terminated!

… shutdown, to avoid deadlock.

In the situation, the closure may hold the last reference of NodeImpl, and done->Run() triggers the
the destructor of NodeImpl. In the destructor, NodeImpl joins the disk thread to terminate, but
the done->Run() itself is excuted in the disk thread. The disk thread can never be terminated!
@PFZheng PFZheng changed the title LogManager::disk_thread run closures in bthread after the log manager… LogManager::disk_thread run closures in bthread after the log manager shutdown, to avoid deadlock Nov 4, 2020
@PFZheng PFZheng linked an issue Nov 4, 2020 that may be closed by this pull request
@PFZheng PFZheng requested a review from Edward-xk November 4, 2020 02:32
@chenzhangyi
Copy link
Collaborator

能构造ut去复现么?

@PFZheng
Copy link
Collaborator Author

PFZheng commented Nov 10, 2020

能构造ut去复现么?

可以的,我加一下

@zhongleihe
Copy link

#0 0x00007fc4614a488b in jump_stack (to=, from=) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/stack_inl.h:132
#1 bthread::TaskGroup::sched_to (pg=pg@entry=0x7fbcdd7ab1c8, next_meta=next_meta@entry=0x7fc35084b008) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/task_group.cpp:604
#2 0x00007fc4614a6023 in sched_to (next_tid=, pg=0x7fbcdd7ab1c8) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/task_group_inl.h:78
#3 bthread::TaskGroup::sched (pg=pg@entry=0x7fbcdd7ab1c8) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/task_group.cpp:562
#4 0x00007fc4614b65eb in bthread::butex_wait (arg=0x7fbd22d46f88, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/butex.cpp:659
#5 0x00007fc4614a030f in bthread::ExecutionQueueBase::join (id=) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/execution_queue.cpp:225
#6 0x00007fc462574d45 in execution_queue_joinbraft::LogManager::StableClosure* (id=...) at /usr/include/bthread/execution_queue_inl.h:381
#7 braft::LogManager::stop_disk_thread (this=this@entry=0x7fc138fdfe08) at /mraft/src/braft/log_manager.cpp:142
#8 0x00007fc4625765d9 in braft::LogManager::~LogManager (this=0x7fc138fdfe08, __in_chrg=) at /mraft/src/braft/log_manager.cpp:124
#9 0x00007fc46259f6b6 in braft::NodeImpl::~NodeImpl (this=0x7fbc24c71008, __in_chrg=) at /mraft/src/braft/node.cpp:218
#10 0x00007fc46259fc71 in braft::NodeImpl::~NodeImpl (this=0x7fbc24c71008, __in_chrg=) at /mraft/src/braft/node.cpp:254
#11 0x00007fc4625ab6ec in DeleteInternal (x=0x7fbc24c71008) at /usr/include/butil/memory/ref_counted.h:191
#12 Destruct (x=0x7fbc24c71008) at /usr/include/butil/memory/ref_counted.h:154
#13 Release (this=0x7fbc24c71010) at /usr/include/butil/memory/ref_counted.h:182
#14 ~FollowerStableClosure (this=0x7fc087f331c8, __in_chrg=) at /mraft/src/braft/node.cpp:2683
#15 braft::FollowerStableClosure::~FollowerStableClosure (this=0x7fc087f331c8, __in_chrg=) at /mraft/src/braft/node.cpp:2685
#16 0x00007fc4625ac784 in braft::FollowerStableClosure::Run (this=0x7fc087f331c8) at /mraft/src/braft/node.cpp:2678
#17 0x00007fc46257f396 in braft::AppendBatcher::flush (this=this@entry=0x7fbcdd7ab620) at /mraft/src/braft/log_manager.cpp:654
#18 0x00007fc4625775e8 in braft::LogManager::disk_thread (meta=0x7fc138fdfe08, iter=...) at /mraft/src/braft/log_manager.cpp:782
#19 0x00007fc4614a066d in bthread::ExecutionQueueBase::_execute (this=this@entry=0x7fbd2abad308, head=0x7fc102cef788, high_priority=high_priority@entry=false, niterated=niterated@entry=0x0)
at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/execution_queue.cpp:272
#20 0x00007fc4614a2150 in bthread::ExecutionQueueBase::_execute_tasks (arg=) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/execution_queue.cpp:151
#21 0x00007fc4614a630a in bthread::TaskGroup::task_runner (skip_remained=) at /opt/hezhonglei/mraftdevelop/mraft/deps/incubator-brpc/src/bthread/task_group.cpp:296
#22 0x00007fc4614b0981 in bthread_make_fcontext () from /lib64/libbrpc.so
大量并发场景下死锁了

@zouyonghao
Copy link
Contributor

这个bug有可能在example的例子里面复现么

@GOGOYAO
Copy link

GOGOYAO commented Feb 3, 2022

这个pr还不能合入?

@@ -816,6 +840,7 @@ void LogManager::set_applied_id(const LogId& applied_id) {
void LogManager::shutdown() {
std::unique_lock<raft_mutex_t> lck(_mutex);
_stopped = true;
CHECK_EQ(0, bthread::execution_queue_execute(_disk_queue, new ShutdownClosure));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的fix是不是还是有问题:
如果在 shutdown 之前有日志添加进来,在 shudown 之后没有新的日志添加进来。
那么 shutdown 之前的日志(最后一条),依然有可能在 disk_thread 里面执行,导致 自己 join 自己

@@ -600,6 +619,11 @@ int LogManager::disk_thread(void* meta,
ret = log_manager->_log_storage->reset(rc->next_log_index());
break;
}
ShutdownClosure* sc = dynamic_cast<ShutdownClosure*>(done);
if (sc) {
ret = log_manager->_shutdown = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该不需要 ret 返回值

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodeImpl析构低概率死锁
7 participants