ci/tasks.py: offload testjob post processing to its own task #1115

chaws · 2023-12-07T14:32:19Z

The reason for having this is for deployments of SQUAD on auto-scalable systems such as Kubernetes. When the load in SQUAD is high, Kubernetes creates new replicas of workers to consume from the queue.

When the load is back to low, Kubernetes starts trimming workers no longer being used. There is a very specific corner case with this approach though.

When Kubernetes trims a worker, it sends SIGTERM to it and wait 30s by default for the worker to self terminate. In Linaro's deployment of SQUAD, there is a particular kind of test job that comes from Android CTS/VTS. They are huge and take a lot more than 30s to finish. If the worker is not finished by the 30s mark, Kubernetes sends SIGKILL to it and it dies abruptly, causing inconsistencies.

Yes we can increase the 30s timeout, but if SQUAD is under heavy load, increasing the timeout might still cause inconsistency if the worker doesn't self terminate in that timeout.

The solution fo this problem is the creation of a new queue called 'ci_fetch_postprocess'. Deployments with great load should then create a different kind of worker that never dies and does not auto-scale, thus eliminating the problem completely.

Tasks in 'ci_fetch_postprocess' are the plugin ones, which are the culprit of the issue.

The reason for having this is for deployments of SQUAD on auto-scalable systems such as Kubernetes. When the load in SQUAD is high, Kubernetes creates new replicas of workers to consume from the queue. When the load is back to low, Kubernetes starts trimming workers no longer being used. There is a very specific corner case with this approach though. When Kubernetes trims a worker, it sends SIGTERM to it and wait 30s by default for the worker to self terminate. In Linaro's deployment of SQUAD, there is a particular kind of test job that comes from Android CTS/VTS. They are huge and take a lot more than 30s to finish. If the worker is not finished by the 30s mark, Kubernetes sends SIGKILL to it and it dies abruptly, causing inconsistencies. Yes we can increase the 30s timeout, but if SQUAD is under heavy load, increasing the timeout might still cause inconsistency if the worker doesn't self terminate in that timeout. The solution fo this problem is the creation of a new queue called 'ci_fetch_postprocess'. Deployments with great load should then create a different kind of worker that never dies and does not auto-scale, thus eliminating the problem completely. Tasks in 'ci_fetch_postprocess' are the plugin ones, which are the culprit of the issue.

chaws merged commit 061ec99 into Linaro:master Dec 12, 2023
7 checks passed

chaws deleted the fix-postprocess-sync branch December 12, 2023 14:46

chaws mentioned this pull request Dec 20, 2023

Fix postprocess sync #1117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci/tasks.py: offload testjob post processing to its own task #1115

ci/tasks.py: offload testjob post processing to its own task #1115

chaws commented Dec 7, 2023

ci/tasks.py: offload testjob post processing to its own task #1115

ci/tasks.py: offload testjob post processing to its own task #1115

Conversation

chaws commented Dec 7, 2023