-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit badger gc concurrency to 1 to avoid panic #14340
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appears that pubsub e.g. readSubscriberPosition
and writeSubscriberPosition
needs to change too
This pull request is now in conflicts. Could you fix it @carsonip? 🙏
|
@@ -558,7 +575,13 @@ func (p *Processor) Run() error { | |||
return nil | |||
} | |||
|
|||
// subscriberPositionFileMutex protects the subscriber file from concurrent RW, in case of hot reload. | |||
var subscriberPositionFileMutex sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[to reviewer] This ended up as a mutex over the subscriber file only, but not the subscriber goroutine. Although this means possibly duplicate work (e.g. searching in ES) during the overlap, any position written to subscriber file is a position that is processed. Running 2 subscriber goroutines concurrently does not present any correctness issues.
select { | ||
case <-p.stopping: | ||
return nil | ||
case gcCh <- struct{}{}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[to reviewer] Used a channel here instead of a sync.Mutex, just to avoid blocking the goroutine shutdown in case Stop is called. I cannot imagine a case where mu.Lock() will block the shutdown, but just to err on the safe side.
@@ -668,6 +669,31 @@ func TestStorageGC(t *testing.T) { | |||
t.Fatal("timed out waiting for value log garbage collection") | |||
} | |||
|
|||
func TestStorageGCConcurrency(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[to reviewer] added this test to reproduce the issue. I don't know how to better reproduce it in a simpler way other than setting a short GC interval and sleeping for a second.
This pull request is now in conflicts. Could you fix it @carsonip? 🙏
|
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex. (cherry picked from commit 43e968f)
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex. (cherry picked from commit 43e968f)
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex. (cherry picked from commit 43e968f) Co-authored-by: Carson Ip <[email protected]>
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex. (cherry picked from commit 43e968f) Co-authored-by: Carson Ip <[email protected]>
Motivation/summary
Badger GC will panic when run concurrently. 2 TBS processors may run concurrently during a hot reload. Make TBS processor concurrent-safe by protecting badger gc using a mutex.
Checklist
For functional changes, consider:
How to test these changes
See unit test
Alternatively, somehow (by modifying code locally?) trigger a slow EA reload, and observe that despite 2 TBS processors are running concurrently, gc never runs concurrently.
Related issues
Fixes #14305