-
Notifications
You must be signed in to change notification settings - Fork 3.5k
CI Monitoring Runbook
TVM's CI fails on main
and PRs with unrelated changes all the time. The purpose of the rotation above is to get these failures triaged and managed quickly improve the developer experience for TVM. This page documents what to do while you are the "on-call".
Rotation Schedule (issue #11462)
These are mostly suggestions and need not be adhered to by the letter. If there are adjustments or useful information you discover during your rotation, please add them to this page.
- During each week day of your rotation, check the CI status of TVM commits during that day: https://github.com/apache/tvm/commits/main (failures are also messaged to the
#tvm-ci-failures
channel on Discord)-
If there are failures, investigate each one by clicking through to the logs
-
Every failure on
main
should get an issue filed or be linked to an existing issue -
If the failure is due to a test, file an issue via the link in the failing PR run and look in the git blame history for that test to find people to tag. All flaky test issues should have at least 1 person tagged.
-
The "Report flaky test shortcut" link shows up at the end of unit test steps, e.g.
-
Disable the flaky tests in code and submit a PR. Make sure to link the issue in the PR message
@pytest.mark.skip(reason="Disabled due to flakiness <issue link") def test_some_test_case(): ...
-
-
If the failure happens between multiple commits, find the offending commit and submit a PR to revert it. This can happen on occasion due to merge races.
git revert <commit hash that broke CI> git commit --amend # prepend '[skip ci]' to the commit title before you submit the PR
-
If the failure is due to a CI problem (e.g. a timeout, network problems, failure outside of a build or test), file an issue with
[ci]
prefixed to the title. This will auto-tag all CI maintainers.
-
- Once you are done triaging a failure, message a short summary in reply to the automated message in
#tvm-ci-failures
on Discord - It is not your job to fix the failures! You can if you want, but the only expectation is triaging
- This is not a real on-call, if you can't check on some day then it's fine to skip it and check both days on the next. If you're going to be out for a week, arrange in
#tvm-ci-failures
to swap rotations with someone. If you find the workload too much, you can resign from the rotation or scale back your participation. - Rotations are a week long in order to make it simpler to know if you're on or not and give people time to work out their own flows.
Interested in helping out? Join the Discord and announce your interest to the team in the #tvm-ci-failures
channel.