Liveness / performance bug in PooledThreadExecutor #2744
Labels
bug
This issue is a bug.
p2
This is a standard priority issue
pending-release
This issue will be fixed by an approved PR that hasn't been released yet.
Describe the bug
My team noticed a performance / "liveness" bug when using the PooledThreadExecutor to handle lots of requests to S3. Our repro is reliable, but unfortunately too complex include with this report, so a description will have to suffice.
We observed lots of "producer" threads waiting for the SDK to complete requests that they had added to the PooledThreadExecutor's queue, while almost all of the PooledThreadExecutor's "consumer" ThreadTasks were idle, waiting at this line.
The simplest description of the bug is that function
ThreadTask::MainTaskRunner()
doesn't use or hold the same mutex, between checkingm_executor.HasTasks()
and when it waits onm_executor.m_sync.WaitOne()
. This allows a concurrent "producer" thread to add a task to the queue, afterThreadTask::MainTaskRunner()
has concluded that the queue is empty, but beforeThreadTask::MainTaskRunner()
waits on theSemaphore
.In such a case, the queue has a task to be performed, but the
ThreadTask
does not realize this.Expected Behavior
When pushing a task to the PooledThreadExecutor's queue, if a
ThreadTask
is available, then it should execute that task.Current Behavior
PooledThreadExecutor's queue is non-empty, but
ThreadTask
s remain idle.In my team's reopro, overall throughput drops dramatically, with hundreds of threads all waiting, ultimately, on the PooledThreadExecutor's Semaphore. (Eventually, in our repro, the "producer" threads fill their queues, so they end up blocked on the
ThreadTask
s, which are not executing the work on the PooledThreadExecutor's queue.)Reproduction Steps
Not feasible to repro at scale, here.
Possible Solution
Function
ThreadTask::MainTaskRunner()
needs to hold the mutex until it waits on theSemaphore
. Unfortunately, the SDK'sSemaphore
class doesn't take a mutex parameter, so the function would probably have to usestd::condition_variable
, instead. (Note that this code already usesstd::lock_guard
andstd::mutex
, so also usingstd::unique_lock
andstd::condition_variable
should not be a problem.)Possible solution, something like this (with corresponding changes made elsewhere):
Additional Information/Context
No response
AWS CPP SDK version used
1.11.x
Compiler and Version used
gcc 8
Operating System and version
Linux 5.4.258
The text was updated successfully, but these errors were encountered: