Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMP] runbot: avoid concurrent update when updating build status #692

Open
wants to merge 2 commits into
base: 16.0
Choose a base branch
from

Conversation

Xavier-Do
Copy link
Contributor

@Xavier-Do Xavier-Do commented Nov 30, 2022

A common issue on runbot.odoo.com are concurrent updates, mainly for build with many subuilds. This is mainly visible during the nightly builds, since there are a lot of builds spread on almost all host. Because of that, a build that should take a few seconds can take multiple minutes to be able to write the status. This phenomenon became more and more visible with the number of hosts in the infrastructure.

This is not an issue most of the time since Split config have a limited number of subuilds (~10) and they don't usually write status at the same time. They also are quite long build meaning that the concurrency is not so visible. This became a problem only with the nightly and multibuilds.

A common example to reproduce the issue is when trying to reproduce a random error: Using a custom config, a database is restored and a test that may last a few seconds is started. If the complete tests takes ~30 seconds, for some of the subuild it can take multiple minutes because of the scheduler retrying to write the result/state.

Example:

2023-03-01 23:00:03,155 ERROR odoo.sql_db: bad query: UPDATE "runbot_build" SET "global_state"='waiting',"write_date"='2023-03-01T23:00:02.682585'::timestamp,"write_uid"=1 WHERE id IN (24549289, 24549938, 24550022)
ERROR: could not serialize access due to concurrent update

The issue is that multiple sub-build finishing at the same time will write at the exact same time on the same record, creating a restart of the scheduler loop for almost all of them.

One of the cause of this issue is that global_state and global_result are store computed fields. In 15.0 the compute field will write on the record even if the value didn't change as soon as the compute was triggered. This fields needs to be stored
since it is displayed on the main page and it wouldn't be a good idea to fetch all sub-builds to have this information for display. This is also usefull to be able to search on this.

We don't really need and synchronous update of a global_state, especially if the update is done quite quickly. The first idea of this pull request is to give the responsibility of the global_state/global_result management to the parent build.

  • Only one host can write on a build, the builds update his own state by checking children states.
  • Not using a compute field will also help avoiding concurrent update since we won't write every time. This could be enough as a solution, since the concurrent update should only happen once, but we would like to avoid having any.

The main downside of this solution is that a build will need to check child states quite often.

  • if a runbot is down, or slow because of a docker image update, the state may take a while to be updated.
  • we need to read all children quite often.

The intermediate solution of not writing computed value every-time would be also enough. This means that we may have at most one concurrent update per child per change of result/state. When right now in the worst case it could be the factorial of number of children. But this solution revealed not being safe in all case because not triggering the update may cause inconsistency if the state could have been different with other childs state. See test_local_status_update demonstrating this issue. The last assertion was not working in the previous version: The two child will compute the parent state at the same time and think that the state is still waiting (other child looks testing).

The final chosen solution uses a queue to avoid reading to much on children.

@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 2 times, most recently from 8a1eb9a to c790750 Compare December 1, 2022 10:25
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 6 times, most recently from 64c167b to af7b527 Compare December 26, 2022 11:20
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 5 times, most recently from 11110ec to c2a53eb Compare March 2, 2023 09:01
@Xavier-Do Xavier-Do marked this pull request as ready for review March 2, 2023 09:03
@Xavier-Do Xavier-Do requested a review from d-fence March 2, 2023 09:03
@Xavier-Do Xavier-Do changed the title 15.0 less concurrent update xdo [IMP] runbot: avoid concurrent update when updating build status Mar 2, 2023
@C3POdoo C3POdoo requested a review from a team March 2, 2023 09:04
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 4 times, most recently from 96ca0ff to 2c2c41a Compare March 5, 2023 13:05
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 9 times, most recently from 21d3296 to b18d187 Compare March 17, 2023 10:02
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 21 times, most recently from 326db6f to 4958eac Compare March 23, 2023 11:41
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch 2 times, most recently from efb1515 to 83488f9 Compare April 11, 2023 11:30
One of the most problematic concurrency issue is when multiple children
tries to write the state/result on the parent concurrently.

Multiple partial solutions are possible here:
- avoid to write if the state doesn't changes
- avoid to write on a build belonging to another host

A maybe overkill solution would be to add a message queue for an host,
signaling that one of the child state changed.

An intermediate solution would be to let the host check the state of the
children while there are some of them, and update the local build state
assynchronously himself.

We can actualy use the 'waiting' global state to now if we need to
continue to check the build state and result.

While a build is not done or running, we need to check all children
result and state on case they were updated.

One corner case is when rebuilding a child: a new child is added
but the parent is maybe not in the 'waiting' global state anymore.
If this is the case, we need to recusivelly change the state of the
parents to waiting so that they will update again.
@Xavier-Do Xavier-Do force-pushed the 15.0-less-concurrent-update-xdo branch from 83488f9 to bbe3369 Compare April 11, 2023 11:31
@Xavier-Do Xavier-Do changed the base branch from 15.0 to 16.0 June 22, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant