[IMP] runbot: avoid concurrent update when updating build status #692

Xavier-Do · 2022-11-30T11:17:41Z

A common issue on runbot.odoo.com are concurrent updates, mainly for build with many subuilds. This is mainly visible during the nightly builds, since there are a lot of builds spread on almost all host. Because of that, a build that should take a few seconds can take multiple minutes to be able to write the status. This phenomenon became more and more visible with the number of hosts in the infrastructure.

This is not an issue most of the time since Split config have a limited number of subuilds (~10) and they don't usually write status at the same time. They also are quite long build meaning that the concurrency is not so visible. This became a problem only with the nightly and multibuilds.

A common example to reproduce the issue is when trying to reproduce a random error: Using a custom config, a database is restored and a test that may last a few seconds is started. If the complete tests takes ~30 seconds, for some of the subuild it can take multiple minutes because of the scheduler retrying to write the result/state.

Example:

2023-03-01 23:00:03,155 ERROR odoo.sql_db: bad query: UPDATE "runbot_build" SET "global_state"='waiting',"write_date"='2023-03-01T23:00:02.682585'::timestamp,"write_uid"=1 WHERE id IN (24549289, 24549938, 24550022)
ERROR: could not serialize access due to concurrent update

The issue is that multiple sub-build finishing at the same time will write at the exact same time on the same record, creating a restart of the scheduler loop for almost all of them.

One of the cause of this issue is that global_state and global_result are store computed fields. In 15.0 the compute field will write on the record even if the value didn't change as soon as the compute was triggered. This fields needs to be stored
since it is displayed on the main page and it wouldn't be a good idea to fetch all sub-builds to have this information for display. This is also usefull to be able to search on this.

We don't really need and synchronous update of a global_state, especially if the update is done quite quickly. The first idea of this pull request is to give the responsibility of the global_state/global_result management to the parent build.

Only one host can write on a build, the builds update his own state by checking children states.
Not using a compute field will also help avoiding concurrent update since we won't write every time. This could be enough as a solution, since the concurrent update should only happen once, but we would like to avoid having any.

The main downside of this solution is that a build will need to check child states quite often.

if a runbot is down, or slow because of a docker image update, the state may take a while to be updated.
we need to read all children quite often.

The intermediate solution of not writing computed value every-time would be also enough. This means that we may have at most one concurrent update per child per change of result/state. When right now in the worst case it could be the factorial of number of children. But this solution revealed not being safe in all case because not triggering the update may cause inconsistency if the state could have been different with other childs state. See test_local_status_update demonstrating this issue. The last assertion was not working in the previous version: The two child will compute the parent state at the same time and think that the state is still waiting (other child looks testing).

The final chosen solution uses a queue to avoid reading to much on children.

One of the most problematic concurrency issue is when multiple children tries to write the state/result on the parent concurrently. Multiple partial solutions are possible here: - avoid to write if the state doesn't changes - avoid to write on a build belonging to another host A maybe overkill solution would be to add a message queue for an host, signaling that one of the child state changed. An intermediate solution would be to let the host check the state of the children while there are some of them, and update the local build state assynchronously himself. We can actualy use the 'waiting' global state to now if we need to continue to check the build state and result. While a build is not done or running, we need to check all children result and state on case they were updated. One corner case is when rebuilding a child: a new child is added but the parent is maybe not in the 'waiting' global state anymore. If this is the case, we need to recusivelly change the state of the parents to waiting so that they will update again.

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 2 times, most recently from 8a1eb9a to c790750 Compare December 1, 2022 10:25

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 6 times, most recently from 64c167b to af7b527 Compare December 26, 2022 11:20

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 5 times, most recently from 11110ec to c2a53eb Compare March 2, 2023 09:01

Xavier-Do marked this pull request as ready for review March 2, 2023 09:03

Xavier-Do requested a review from d-fence March 2, 2023 09:03

Xavier-Do changed the title ~~15.0 less concurrent update xdo~~ [IMP] runbot: avoid concurrent update when updating build status Mar 2, 2023

C3POdoo requested a review from a team March 2, 2023 09:04

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 4 times, most recently from 96ca0ff to 2c2c41a Compare March 5, 2023 13:05

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 9 times, most recently from 21d3296 to b18d187 Compare March 17, 2023 10:02

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 21 times, most recently from 326db6f to 4958eac Compare March 23, 2023 11:41

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 2 times, most recently from efb1515 to 83488f9 Compare April 11, 2023 11:30

Xavier-Do added 2 commits April 11, 2023 13:31

[IMP] runbot: add test for build concurency

aaade72

Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch from 83488f9 to bbe3369 Compare April 11, 2023 11:31

Xavier-Do changed the base branch from 15.0 to 16.0 June 22, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMP] runbot: avoid concurrent update when updating build status #692

[IMP] runbot: avoid concurrent update when updating build status #692

Xavier-Do commented Nov 30, 2022 •

edited

Loading

[IMP] runbot: avoid concurrent update when updating build status #692

Are you sure you want to change the base?

[IMP] runbot: avoid concurrent update when updating build status #692

Conversation

Xavier-Do commented Nov 30, 2022 • edited Loading

Xavier-Do commented Nov 30, 2022 •

edited

Loading