Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

ci/tasks.py: offload testjob post processing to its own task #1115

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

chaws
Copy link
Collaborator

@chaws chaws commented Dec 7, 2023

The reason for having this is for deployments of SQUAD on auto-scalable systems such as Kubernetes. When the load in SQUAD is high, Kubernetes creates new replicas of workers to consume from the queue.

When the load is back to low, Kubernetes starts trimming workers no longer being used. There is a very specific corner case with this approach though.

When Kubernetes trims a worker, it sends SIGTERM to it and wait 30s by default for the worker to self terminate. In Linaro's deployment of SQUAD, there is a particular kind of test job that comes from Android CTS/VTS. They are huge and take a lot more than 30s to finish. If the worker is not finished by the 30s mark, Kubernetes sends SIGKILL to it and it dies abruptly, causing inconsistencies.

Yes we can increase the 30s timeout, but if SQUAD is under heavy load, increasing the timeout might still cause inconsistency if the worker doesn't self terminate in that timeout.

The solution fo this problem is the creation of a new queue called 'ci_fetch_postprocess'. Deployments with great load should then create a different kind of worker that never dies and does not auto-scale, thus eliminating the problem completely.

Tasks in 'ci_fetch_postprocess' are the plugin ones, which are the culprit of the issue.

The reason for having this is for deployments of SQUAD on auto-scalable
systems such as Kubernetes. When the load in SQUAD is high, Kubernetes
creates new replicas of workers to consume from the queue.

When the load is back to low, Kubernetes starts trimming workers no longer
being used. There is a very specific corner case with this approach though.

When Kubernetes trims a worker, it sends SIGTERM to it and wait 30s by default
for the worker to self terminate. In Linaro's deployment of SQUAD, there is a
particular kind of test job that comes from Android CTS/VTS. They are huge and
take a lot more than 30s to finish. If the worker is not finished by the 30s
mark, Kubernetes sends SIGKILL to it and it dies abruptly, causing inconsistencies.

Yes we can increase the 30s timeout, but if SQUAD is under heavy load, increasing
the timeout might still cause inconsistency if the worker doesn't self terminate
in that timeout.

The solution fo this problem is the creation of a new queue called 'ci_fetch_postprocess'.
Deployments with great load should then create a different kind of worker that
never dies and does not auto-scale, thus eliminating the problem completely.

Tasks in 'ci_fetch_postprocess' are the plugin ones, which are the culprit of the issue.
@chaws chaws merged commit 061ec99 into Linaro:master Dec 12, 2023
7 checks passed
@chaws chaws deleted the fix-postprocess-sync branch December 12, 2023 14:46
@chaws chaws mentioned this pull request Dec 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant