-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block-builder: pull jobs from scheduler #10118
Conversation
…der-scheduler-kafka-flush
- accidentally wasn't calling scheduler.assignJob in the RPC. - Add logging for jobQueue operations. - Experimentally tear into blockbuilder to begin consuming jobs from scheduler.
…der-scheduler-kafka-flush
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Overall, looks good to me. I left two small comments, but I don't have strong feelings about them.
// FIXME: I'm currently trying to understand why uncommenting this line causes tests to fail when all run together: | ||
// t.Parallel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ReachHighWatermarkBeforeLastCycleSection
can be flaky because of the require.Eventually
. Maybe, consider bumping its waiting duration to 20s?
Or, this could be the side effect of the panic #10391 — I originally noticed the panic when was looking at this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're onto something. When I run the entire file's tests at once with this t.Parallel
uncommented, I get this test and a random other one failing with messages like:
Error Trace: /Users/davidgrant/dev/mimir/pkg/blockbuilder/blockbuilder_test.go:469
Error: Condition never satisfied
Test: TestBlockBuilder_ReachHighWatermarkBeforeLastCycleSection
Messages: expected kafka commits
I took a bunch of traces and it turns out that when you crank up the parallelism all file I/O and other syscalls get monumentally slowed down on OS X, but not Linux. So I think this is fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Only small comments.
For future: I see a lot of code duplication in the test code, both in new and old tests, mainly in setting up kafka, BB, and something around adding samples. We should be able to deduplicate a lot of it so that tests are easier to read.
if _, err := b.consumeJob(ctx, key, spec); err != nil { | ||
level.Error(b.logger).Log("msg", "failed to consume job", "job_id", key.Id, "epoch", key.Epoch, "err", err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to tell back to scheduler that the job failed. But I guess scheduler will know when a job as not received an update for some (short) time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I recall discussing this in a design document. Scheduler will know when it has failed as it won't receive an update within X seconds. We can always enhance this by adding failure info to the UpdateJob RPC. Initially I'm just keeping it barebones.
What this PR does
This adds a "pull" mode to block-builder so that if it is configured with a scheduler at startup, it will live its life in pull-mode, obtaining and completing jobs from a block-builder-scheduler service.
Pull-mode tests are largely duplicated from existing tests. This is temporary as the prior tests will be deleted once we're relying on the scheduler.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.