Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit badger gc concurrency to 1 to avoid panic #14340

Merged
merged 10 commits into from
Oct 21, 2024

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Oct 11, 2024

Motivation/summary

Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

See unit test

Alternatively, somehow (by modifying code locally?) trigger a slow EA reload, and observe that despite 2 TBS processors are running concurrently, gc never runs concurrently.

Related issues

Fixes #14305

Copy link
Contributor

mergify bot commented Oct 11, 2024

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Oct 11, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 11, 2024
Copy link
Member Author

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appears that pubsub e.g. readSubscriberPosition and writeSubscriberPosition needs to change too

Copy link
Contributor

mergify bot commented Oct 14, 2024

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-tbs-in-hot-reload upstream/fix-tbs-in-hot-reload
git merge upstream/main
git push upstream fix-tbs-in-hot-reload

@@ -558,7 +575,13 @@ func (p *Processor) Run() error {
return nil
}

// subscriberPositionFileMutex protects the subscriber file from concurrent RW, in case of hot reload.
var subscriberPositionFileMutex sync.Mutex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] This ended up as a mutex over the subscriber file only, but not the subscriber goroutine. Although this means possibly duplicate work (e.g. searching in ES) during the overlap, any position written to subscriber file is a position that is processed. Running 2 subscriber goroutines concurrently does not present any correctness issues.

select {
case <-p.stopping:
return nil
case gcCh <- struct{}{}:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] Used a channel here instead of a sync.Mutex, just to avoid blocking the goroutine shutdown in case Stop is called. I cannot imagine a case where mu.Lock() will block the shutdown, but just to err on the safe side.

@carsonip carsonip changed the title Add mutex to avoid concurrent gc during hot reload Forbid concurrent badger gc runs Oct 16, 2024
@carsonip carsonip marked this pull request as ready for review October 16, 2024 16:45
@carsonip carsonip requested a review from a team as a code owner October 16, 2024 16:45
@carsonip carsonip requested review from axw and 1pkg October 16, 2024 16:45
rubvs
rubvs previously approved these changes Oct 16, 2024
1pkg
1pkg previously approved these changes Oct 16, 2024
@carsonip carsonip added the backport-8.16 Automated backport with mergify label Oct 17, 2024
@carsonip carsonip dismissed stale reviews from 1pkg and rubvs via b744adf October 17, 2024 16:25
@carsonip carsonip requested review from rubvs and 1pkg October 17, 2024 16:26
@@ -668,6 +669,31 @@ func TestStorageGC(t *testing.T) {
t.Fatal("timed out waiting for value log garbage collection")
}

func TestStorageGCConcurrency(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] added this test to reproduce the issue. I don't know how to better reproduce it in a simpler way other than setting a short GC interval and sleeping for a second.

1pkg
1pkg previously approved these changes Oct 18, 2024
Copy link
Contributor

mergify bot commented Oct 21, 2024

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-tbs-in-hot-reload upstream/fix-tbs-in-hot-reload
git merge upstream/main
git push upstream fix-tbs-in-hot-reload

@carsonip carsonip requested a review from 1pkg October 21, 2024 08:06
@carsonip carsonip enabled auto-merge (squash) October 21, 2024 08:07
@carsonip carsonip changed the title Forbid concurrent badger gc runs Limit badger gc concurrency to 1 even when 2 TBS processors are active Oct 21, 2024
@carsonip carsonip changed the title Limit badger gc concurrency to 1 even when 2 TBS processors are active Limit badger gc concurrency to 1 to avoid panic Oct 21, 2024
@carsonip carsonip enabled auto-merge (squash) October 21, 2024 08:10
@carsonip carsonip merged commit 43e968f into elastic:main Oct 21, 2024
15 checks passed
mergify bot pushed a commit that referenced this pull request Oct 21, 2024
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.

(cherry picked from commit 43e968f)
mergify bot pushed a commit that referenced this pull request Oct 21, 2024
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.

(cherry picked from commit 43e968f)
mergify bot added a commit that referenced this pull request Oct 21, 2024
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.

(cherry picked from commit 43e968f)

Co-authored-by: Carson Ip <[email protected]>
mergify bot added a commit that referenced this pull request Oct 21, 2024
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.

(cherry picked from commit 43e968f)

Co-authored-by: Carson Ip <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TBS: "Value log GC request rejected" error
4 participants