Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Use job group id in front end and driver queries #13995

Closed
wants to merge 21 commits into from

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Nov 9, 2023

Stacked on #13985. This PR just threads job_group_id everywhere such as the canceller, scheduler, etc.

@jigold jigold force-pushed the thread-job-group-server branch from b858d07 to 5749bc6 Compare November 30, 2023 17:38
@jigold jigold force-pushed the thread-job-group-server branch from 5749bc6 to b16b283 Compare November 30, 2023 18:25
@jigold
Copy link
Contributor Author

jigold commented Nov 30, 2023

This is ready for review. Remember it's stacked on the last database PR.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! These changes feel very routine, I just had a handful of quite small questions and suggestions.

FROM batches
INNER JOIN jobs ON batches.id = jobs.batch_id
FROM job_groups
LEFT JOIN batches ON batches.id = job_groups.batch_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is guaranteed to find a matching row in batches right? Any reason it's a left join not an INNER JOIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try and use left joins rather than inner joins because the errors will be obvious if a row is not present when we expect it to be (NULL values will cause downstream errors). With an inner join, we could silently skip rows. But we do have all of the foreign key constraints so you're right. Let me know if you have a preference. Mine is not particularly strong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try and use left joins rather than inner joins because the errors will be obvious if a row is not present when we expect it to be (NULL values will cause downstream errors)

This sounds very reasonable! I certainly errors and alerts to silent failures, can just keep this as is

if not batch['cancelled']:
async for record in self.db.select_and_fetchall(
if not job_group['cancelled']:
async for record in self.db.select_and_fetchall( # FIXME: Do we need a different index?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an answer to this question? If not, it might be better to instead make a github issue that requires an investigation into this query plan and can be closed by confirming that no additional index is needed or by providing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think we do need to add a new index and modify the last ORDER BY clause to use the job group. Let's add that in #13985.

@@ -39,26 +40,27 @@ async def notify_batch_job_complete(db: Database, client_session: httpx.ClientSe
job_groups_n_jobs_in_complete_states.n_succeeded,
job_groups_n_jobs_in_complete_states.n_failed,
job_groups_n_jobs_in_complete_states.n_cancelled
FROM batches
FROM job_groups
LEFT JOIN batches ON job_groups.batch_id = batches.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here on the left vs inner join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, let's just keep as is

@@ -436,6 +439,7 @@ async def job_config(app, record, attempt_id):
return {
'batch_id': batch_id,
'job_id': job_id,
'job_group_id': job_group_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the job_group_id is both in this dict and in the job_spec field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's in the job_spec field in this PR. I intended to pop that value from the job spec regardless as it could be relative in #14018. Maybe this change should come later once we want the job_group_id to be available as an environment variable on the worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove this functionality for now from this PR as I have a different PR later on that puts the right job group id in the spec when creating jobs.

@@ -446,14 +450,15 @@ async def job_config(app, record, attempt_id):
}


async def mark_job_errored(app, batch_id, job_id, attempt_id, user, format_version, error_msg):
async def mark_job_errored(app, batch_id, job_id, attempt_id, job_group_id, user, format_version, error_msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's more consistent and therefore less error-prone to order these inputs batch_id -> job_group_id -> job_id -> attempt_id

format_version = BatchFormatVersion(record['format_version'])

id = (batch_id, job_id)

try:
body = await job_config(app, record, attempt_id)
body = await job_config(app, record, attempt_id, job_group_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since job_group_id is already in record, I don't think we need to add it as an extra input.

LEFT JOIN job_groups_cancelled ON batches.id = job_groups_cancelled.id
WHERE batches.`state` = 'running'
FROM jobs
INNER JOIN job_groups ON job_groups.batch_id = jobs.batch_id AND job_groups.job_group_id = jobs.job_group_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you swapped the order of the tables in the join? (batches join jobs vs jobs join job_groups). I'm not sure it makes any difference because it's an inner join but it stuck out to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason other than it could have been a rebase oversight as the conflicts were a bit gnarly at one point. What makes the most sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mild preference for not making changes that don't really change anything, so let's just make this FROM job_groups

JSON_OBJECTAGG(resources.resource, quantity * GREATEST(COALESCE(rollup_time - start_time, 0), 0)) as resources
FROM attempt_resources
INNER JOIN attempts
ON attempts.batch_id = attempt_resources.batch_id AND
attempts.job_id = attempt_resources.job_id AND
attempts.attempt_id = attempt_resources.attempt_id
LEFT JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question here as before, I would assume this would be an inner join. Shouldn't affect correctness but I don't know what the idiomatic thing to do here is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, we'll keep as is

''',
(user, batch_id, update_id),
(user, batch_id, ROOT_JOB_GROUP_ID, update_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: will this always remain ROOT_JOB_GROUP_ID because we only allow updates if the whole batch, i.e. the root job group, is not cancelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.

@@ -23,5 +23,5 @@ async def test_invariants():

data = await retry_transient_errors(session.get_read_json, url, headers=headers)

assert data['check_incremental_error'] is None, data
assert data['check_resource_aggregation_error'] is None, data
assert data['check_incremental_error'] == 'None', data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this changed to strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because an actual exception is not JSON serializable making this functionality useless for figuring out what went wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha, nice catch! In that case, it's a little pedantic but I'd prefer to make more use of the type system and have an Optional[str] where no exception does show up as None and if there is an exception we serialize that as a string. The string 'None' feels off to me.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple small changes left in the comments

@jigold jigold force-pushed the thread-job-group-server branch from 7e514e2 to 853d949 Compare January 12, 2024 16:13
@jigold
Copy link
Contributor Author

jigold commented Feb 1, 2024

Closing in favor of #14170

@jigold jigold closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants