-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Investigating CI Runners #2652
Comments
@soheilshahrouz I know you would find this interesting. |
Thanks @AlexandreSinger. Rebalancing is good to shorten the long pole, so PRs complete CI faster. |
The self-hosted runners are down again. I have been looking into the runs that are failing and I am noticing that some jobs are requesting servers that do not exists (for example trying to call self-hosted machine 200+ which does not exists as far as I can tell). I have been looking around google and found the following post: One thing I noticed is this person who changed the runner group of the self-hosted machine and it fixed it. The runner group is what give the machines their numbers. @vaughnbetz if this persists into tomorrow you and I can look into this. I cannot do it on my end since I do not have permission on the VTR repo (also this is probably something we want to do carefully; it may be easy to remove the runners from the group, I worry that adding them back may require access to them). |
My running theory about what may be going on is perhaps the runner version is so far behind that its beginning to have compatibility issues with GitHub. For the last month or so, we have not be able to read the logs visually within the GitHub UI. We kinda ignored this issue since the logs were still accessible through the test settings; however, this may have been an indication that something is wrong with the runners. I wonder if GitHub does not expect people to be on the runner version we may be using and we now may be facing a full deprecation. This may explain why the self-hosted runners did not work for a couple of days, then worked again for a couple of days, and are now not working again. Perhaps behind the scenes GitHub is making changes and are only testing on recent runner versions. I am still not sure which version of the GitHub runners we are currently using; all that I know is that our version of the GitHub runners must be less than v2.308.0 (since that is the most recent version which produced the error we saw when upgrading the actions in a previous issue). |
Thanks. @hzeller @mithro @QuantamHD : we're looking to update the self-hosted runners in the Google Cloud to a later image/github action version. However, we're not sure where the image is stored or how to update it. Help would be much appreciated! |
The self-hosted runners were down for the last couple of days and has only now gotten back up. I wanted to investigate any anomolies in the logs of the CIs to see if we have any issues in the testcases we are running which my have caused it.
The motivation behind this investigation is this message produced by the CI when the self-hosted runners were not working:
I went through the logs of the last working nightly test on the master branch ( https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/9932067866 ) and here are the results for the jobs run on the self-hosted runners (this data was collected from the figures at the bottom of the logs). I also collected their run time since I thought it may be valuable.
The biggest thing that catches my eye is the RAM usage for some of the tests are very close to (what I think to be) the capacity of the machine (125 GB). This is caused by each job using 16 cores to run each test. I doubt this is what caused the problem, since we still have some head room.
I also noticed that few tests take longer than others. Just something to note down.
My biggest concern is that, since some of these jobs are so close to the limit; changes people may be making locally in their PRs while developing may cause the CI to have some issues. For example, if someone accidentally put a memory leak in their code while developing and push the code without testing locally it may bring down the CI. This does not appear to be what happened here since the last run of the CI succeeded without such issues.
I wanted to raise this investigation as an issue to see what people think.
The text was updated successfully, but these errors were encountered: