Skip to content

[Bug](profile) move watcher.stop() into locked code block#62683

Open
BiteTheDDDDt wants to merge 1 commit intoapache:branch-4.0from
BiteTheDDDDt:fix/dep-watcher-race-4.0
Open

[Bug](profile) move watcher.stop() into locked code block#62683
BiteTheDDDDt wants to merge 1 commit intoapache:branch-4.0from
BiteTheDDDDt:fix/dep-watcher-race-4.0

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #56462

Problem Summary:

Backport of #56462 to this branch.

Dependency::set_ready() previously called _watcher.stop() outside _task_lock. A concurrent is_blocked_by() (which acquires _task_lock, checks _ready, and calls start_watcher()) could re-start the watcher right after stop() but before _ready = true was published; nothing stops it again. As a result watcher_elapse_time() accumulates wall-clock time from the first block until the operator closes, making WaitForDependency / WaitForRuntimeFilter profile counters appear hugely inflated (e.g. ~12s on a 20s query while each individual RF WaitTime is only tens of ms).

Move _watcher.stop() inside the _task_lock block, before setting _ready = true, matching the master fix.

Release note

None

Check List (For Author)

cherry-picked from apache#56462 (master commit 83c7020).

Without taking _task_lock, _watcher.stop() in Dependency::set_ready()
races with start_watcher() called inside is_blocked_by(): the latter
may observe _ready==false (it acquires _task_lock and reads _ready
before set_ready() can flip it) and re-start the stopwatch right after
set_ready() stopped it. After the race nothing stops the watcher again
and watcher_elapse_time() reported in close() reflects the operator's
entire lifetime instead of the actual blocked duration. This inflates
WaitForDependency[*]Time and WaitForRuntimeFilter counters, which in
production were observed to be ~12s while real per-RF wait times were
only tens of milliseconds.

Fix: stop the watcher inside the _task_lock critical section so it is
strictly mutually exclusive with start_watcher().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 13:14
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 21, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Backports the fix from #56462 to prevent a race where a dependency watcher can be re-started between stop() and publishing _ready = true, inflating profile wait-time counters.

Changes:

  • Move _watcher.stop() under _task_lock in Dependency::set_ready().
  • Ensure watcher stop happens before setting _ready = true while holding the mutex.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.01% (19271/36352)
Line Coverage 36.19% (179647/496422)
Region Coverage 32.81% (139459/425110)
Branch Coverage 33.73% (60443/179223)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.35% (25392/35586)
Line Coverage 54.08% (268000/495536)
Region Coverage 51.55% (221362/429386)
Branch Coverage 53.04% (95403/179854)

Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants