-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Use job group id in front end and driver queries #13995
Conversation
c7a014a
to
250a1f6
Compare
9266bf5
to
48a3233
Compare
48a3233
to
339931d
Compare
18f0c7a
to
1b37ed1
Compare
b858d07
to
5749bc6
Compare
5749bc6
to
b16b283
Compare
This is ready for review. Remember it's stacked on the last database PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! These changes feel very routine, I just had a handful of quite small questions and suggestions.
FROM batches | ||
INNER JOIN jobs ON batches.id = jobs.batch_id | ||
FROM job_groups | ||
LEFT JOIN batches ON batches.id = job_groups.batch_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is guaranteed to find a matching row in batches
right? Any reason it's a left join not an INNER JOIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try and use left joins rather than inner joins because the errors will be obvious if a row is not present when we expect it to be (NULL values will cause downstream errors). With an inner join, we could silently skip rows. But we do have all of the foreign key constraints so you're right. Let me know if you have a preference. Mine is not particularly strong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try and use left joins rather than inner joins because the errors will be obvious if a row is not present when we expect it to be (NULL values will cause downstream errors)
This sounds very reasonable! I certainly errors and alerts to silent failures, can just keep this as is
if not batch['cancelled']: | ||
async for record in self.db.select_and_fetchall( | ||
if not job_group['cancelled']: | ||
async for record in self.db.select_and_fetchall( # FIXME: Do we need a different index? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an answer to this question? If not, it might be better to instead make a github issue that requires an investigation into this query plan and can be closed by confirming that no additional index is needed or by providing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think we do need to add a new index and modify the last ORDER BY
clause to use the job group. Let's add that in #13985.
@@ -39,26 +40,27 @@ async def notify_batch_job_complete(db: Database, client_session: httpx.ClientSe | |||
job_groups_n_jobs_in_complete_states.n_succeeded, | |||
job_groups_n_jobs_in_complete_states.n_failed, | |||
job_groups_n_jobs_in_complete_states.n_cancelled | |||
FROM batches | |||
FROM job_groups | |||
LEFT JOIN batches ON job_groups.batch_id = batches.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here on the left vs inner join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, let's just keep as is
batch/batch/driver/job.py
Outdated
@@ -436,6 +439,7 @@ async def job_config(app, record, attempt_id): | |||
return { | |||
'batch_id': batch_id, | |||
'job_id': job_id, | |||
'job_group_id': job_group_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the job_group_id
is both in this dict and in the job_spec
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's in the job_spec
field in this PR. I intended to pop that value from the job spec regardless as it could be relative in #14018. Maybe this change should come later once we want the job_group_id
to be available as an environment variable on the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove this functionality for now from this PR as I have a different PR later on that puts the right job group id in the spec when creating jobs.
batch/batch/driver/job.py
Outdated
@@ -446,14 +450,15 @@ async def job_config(app, record, attempt_id): | |||
} | |||
|
|||
|
|||
async def mark_job_errored(app, batch_id, job_id, attempt_id, user, format_version, error_msg): | |||
async def mark_job_errored(app, batch_id, job_id, attempt_id, job_group_id, user, format_version, error_msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's more consistent and therefore less error-prone to order these inputs batch_id -> job_group_id -> job_id -> attempt_id
batch/batch/driver/job.py
Outdated
format_version = BatchFormatVersion(record['format_version']) | ||
|
||
id = (batch_id, job_id) | ||
|
||
try: | ||
body = await job_config(app, record, attempt_id) | ||
body = await job_config(app, record, attempt_id, job_group_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since job_group_id
is already in record
, I don't think we need to add it as an extra input.
batch/batch/driver/main.py
Outdated
LEFT JOIN job_groups_cancelled ON batches.id = job_groups_cancelled.id | ||
WHERE batches.`state` = 'running' | ||
FROM jobs | ||
INNER JOIN job_groups ON job_groups.batch_id = jobs.batch_id AND job_groups.job_group_id = jobs.job_group_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you swapped the order of the tables in the join? (batches join jobs vs jobs join job_groups). I'm not sure it makes any difference because it's an inner join but it stuck out to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason other than it could have been a rebase oversight as the conflicts were a bit gnarly at one point. What makes the most sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a mild preference for not making changes that don't really change anything, so let's just make this FROM job_groups
JSON_OBJECTAGG(resources.resource, quantity * GREATEST(COALESCE(rollup_time - start_time, 0), 0)) as resources | ||
FROM attempt_resources | ||
INNER JOIN attempts | ||
ON attempts.batch_id = attempt_resources.batch_id AND | ||
attempts.job_id = attempt_resources.job_id AND | ||
attempts.attempt_id = attempt_resources.attempt_id | ||
LEFT JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question here as before, I would assume this would be an inner join. Shouldn't affect correctness but I don't know what the idiomatic thing to do here is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, we'll keep as is
''', | ||
(user, batch_id, update_id), | ||
(user, batch_id, ROOT_JOB_GROUP_ID, update_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: will this always remain ROOT_JOB_GROUP_ID
because we only allow updates if the whole batch, i.e. the root job group, is not cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct.
batch/test/test_invariants.py
Outdated
@@ -23,5 +23,5 @@ async def test_invariants(): | |||
|
|||
data = await retry_transient_errors(session.get_read_json, url, headers=headers) | |||
|
|||
assert data['check_incremental_error'] is None, data | |||
assert data['check_resource_aggregation_error'] is None, data | |||
assert data['check_incremental_error'] == 'None', data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this changed to strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because an actual exception is not JSON serializable making this functionality useless for figuring out what went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha, nice catch! In that case, it's a little pedantic but I'd prefer to make more use of the type system and have an Optional[str]
where no exception does show up as None
and if there is an exception we serialize that as a string. The string 'None'
feels off to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple small changes left in the comments
7e514e2
to
853d949
Compare
Closing in favor of #14170 |
Stacked on #13985. This PR just threads job_group_id everywhere such as the canceller, scheduler, etc.