[IMP] runbot: avoid concurrent update when updating build status #692
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A common issue on runbot.odoo.com are concurrent updates, mainly for build with many subuilds. This is mainly visible during the nightly builds, since there are a lot of builds spread on almost all host. Because of that, a build that should take a few seconds can take multiple minutes to be able to write the status. This phenomenon became more and more visible with the number of hosts in the infrastructure.
This is not an issue most of the time since Split config have a limited number of subuilds (~10) and they don't usually write status at the same time. They also are quite long build meaning that the concurrency is not so visible. This became a problem only with the nightly and multibuilds.
A common example to reproduce the issue is when trying to reproduce a random error: Using a custom config, a database is restored and a test that may last a few seconds is started. If the complete tests takes ~30 seconds, for some of the subuild it can take multiple minutes because of the scheduler retrying to write the result/state.
Example:
The issue is that multiple sub-build finishing at the same time will write at the exact same time on the same record, creating a restart of the scheduler loop for almost all of them.
One of the cause of this issue is that global_state and global_result are store computed fields. In 15.0 the compute field will write on the record even if the value didn't change as soon as the compute was triggered. This fields needs to be stored
since it is displayed on the main page and it wouldn't be a good idea to fetch all sub-builds to have this information for display. This is also usefull to be able to search on this.
We don't really need and synchronous update of a global_state, especially if the update is done quite quickly. The first idea of this pull request is to give the responsibility of the global_state/global_result management to the parent build.
The main downside of this solution is that a build will need to check child states quite often.
The intermediate solution of not writing computed value every-time would be also enough. This means that we may have at most one concurrent update per child per change of result/state. When right now in the worst case it could be the factorial of number of children. But this solution revealed not being safe in all case because not triggering the update may cause inconsistency if the state could have been different with other childs state. See
test_local_status_update
demonstrating this issue. The last assertion was not working in the previous version: The two child will compute the parent state at the same time and think that the state is still waiting (other child looks testing).The final chosen solution uses a queue to avoid reading to much on children.