[Bug](profile) move watcher.stop() into locked code block#62683
[Bug](profile) move watcher.stop() into locked code block#62683BiteTheDDDDt wants to merge 1 commit intoapache:branch-4.0from
Conversation
cherry-picked from apache#56462 (master commit 83c7020). Without taking _task_lock, _watcher.stop() in Dependency::set_ready() races with start_watcher() called inside is_blocked_by(): the latter may observe _ready==false (it acquires _task_lock and reads _ready before set_ready() can flip it) and re-start the stopwatch right after set_ready() stopped it. After the race nothing stops the watcher again and watcher_elapse_time() reported in close() reflects the operator's entire lifetime instead of the actual blocked duration. This inflates WaitForDependency[*]Time and WaitForRuntimeFilter counters, which in production were observed to be ~12s while real per-RF wait times were only tens of milliseconds. Fix: stop the watcher inside the _task_lock critical section so it is strictly mutually exclusive with start_watcher(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Backports the fix from #56462 to prevent a race where a dependency watcher can be re-started between stop() and publishing _ready = true, inflating profile wait-time counters.
Changes:
- Move
_watcher.stop()under_task_lockinDependency::set_ready(). - Ensure watcher stop happens before setting
_ready = truewhile holding the mutex.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #56462
Problem Summary:
Backport of #56462 to this branch.
Dependency::set_ready()previously called_watcher.stop()outside_task_lock. A concurrentis_blocked_by()(which acquires_task_lock, checks_ready, and callsstart_watcher()) could re-start the watcher right afterstop()but before_ready = truewas published; nothing stops it again. As a resultwatcher_elapse_time()accumulates wall-clock time from the first block until the operator closes, makingWaitForDependency/WaitForRuntimeFilterprofile counters appear hugely inflated (e.g. ~12s on a 20s query while each individual RFWaitTimeis only tens of ms).Move
_watcher.stop()inside the_task_lockblock, before setting_ready = true, matching the master fix.Release note
None
Check List (For Author)