Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block-builder: pull jobs from scheduler #10118

Merged
merged 79 commits into from
Jan 18, 2025

Conversation

seizethedave
Copy link
Contributor

@seizethedave seizethedave commented Dec 4, 2024

What this PR does

This adds a "pull" mode to block-builder so that if it is configured with a scheduler at startup, it will live its life in pull-mode, obtaining and completing jobs from a block-builder-scheduler service.

Pull-mode tests are largely duplicated from existing tests. This is temporary as the prior tests will be deleted once we're relying on the scheduler.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

- accidentally wasn't calling scheduler.assignJob in the RPC.
- Add logging for jobQueue operations.
- Experimentally tear into blockbuilder to begin consuming jobs from scheduler.
@seizethedave seizethedave marked this pull request as ready for review January 3, 2025 00:48
@seizethedave seizethedave requested a review from a team as a code owner January 3, 2025 00:48
Copy link
Contributor

@narqo narqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Overall, looks good to me. I left two small comments, but I don't have strong feelings about them.

pkg/blockbuilder/blockbuilder.go Outdated Show resolved Hide resolved
Comment on lines 491 to 492
// FIXME: I'm currently trying to understand why uncommenting this line causes tests to fail when all run together:
// t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ReachHighWatermarkBeforeLastCycleSection can be flaky because of the require.Eventually. Maybe, consider bumping its waiting duration to 20s?

Or, this could be the side effect of the panic #10391 — I originally noticed the panic when was looking at this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're onto something. When I run the entire file's tests at once with this t.Parallel uncommented, I get this test and a random other one failing with messages like:

        	Error Trace:	/Users/davidgrant/dev/mimir/pkg/blockbuilder/blockbuilder_test.go:469
        	Error:      	Condition never satisfied
        	Test:       	TestBlockBuilder_ReachHighWatermarkBeforeLastCycleSection
        	Messages:   	expected kafka commits

I took a bunch of traces and it turns out that when you crank up the parallelism all file I/O and other syscalls get monumentally slowed down on OS X, but not Linux. So I think this is fine for now.

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Only small comments.

For future: I see a lot of code duplication in the test code, both in new and old tests, mainly in setting up kafka, BB, and something around adding samples. We should be able to deduplicate a lot of it so that tests are easier to read.

Comment on lines +199 to +202
if _, err := b.consumeJob(ctx, key, spec); err != nil {
level.Error(b.logger).Log("msg", "failed to consume job", "job_id", key.Id, "epoch", key.Epoch, "err", err)
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to tell back to scheduler that the job failed. But I guess scheduler will know when a job as not received an update for some (short) time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I recall discussing this in a design document. Scheduler will know when it has failed as it won't receive an update within X seconds. We can always enhance this by adding failure info to the UpdateJob RPC. Initially I'm just keeping it barebones.

pkg/blockbuilder/config_test.go Outdated Show resolved Hide resolved
pkg/blockbuilder/blockbuilder_test.go Show resolved Hide resolved
@seizethedave seizethedave merged commit 49bc44d into main Jan 18, 2025
28 checks passed
@seizethedave seizethedave deleted the davidgrant/block-builder-pull-mode branch January 18, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants