Skip to content

CI Monitoring Runbook

driazati edited this page Jun 22, 2022 · 18 revisions

TVM's CI fails on main and PRs with unrelated changes all the time. The purpose of the rotation above is to get these failures triaged and managed quickly improve the developer experience for TVM. This page documents what to do while you are the "on-call".

Rotation Schedule (issue #11462)

#tvm-ci-failures Discord

Monitoring

These are mostly suggestions and need not be adhered to by the letter. If there are adjustments or useful information you discover during your rotation, please add them to this page.

  • During each week day of your rotation, check the CI status of TVM commits during that day: https://github.com/apache/tvm/commits/main (failures are also messaged to the #tvm-ci-failures channel on Discord)
    • If there are failures, investigate each one by clicking through to the logs

    • If the failure is due to a test, file an issue via the link in the failing PR run and look in the git blame history for that test to find people to tag. All flaky test issues should have at least 1 person tagged.

      • The "Report flaky test shortcut" link shows up at the end of unit test steps, e.g. image

      • Disable the flaky tests in code and submit a PR. Make sure to link the issue in the PR message

        @pytest.mark.skip(reason="Disabled due to flakiness <issue link")
        def test_some_test_case():
            ...
    • If the failure happens between multiple commits, find the offending commit and submit a PR to revert it. This can happen on occasion due to merge races.

      git revert <commit hash that broke CI>
      git commit --amend  # prepend '[skip ci]' to the commit title before you submit the PR
    • If the failure is due to a CI problem (e.g. a timeout, network problems, failure outside of a build or test), file an issue with [ci] prefixed to the title. This will auto-tag all CI maintainers.

  • Once you are done triaging a failure, message a short summary in reply to the automated message in #tvm-ci-failures on Discord
  • It is not your job to fix the failures! You can if you want, but the only expectation is triaging
  • This is not a real on-call, if you can't check on some day then it's fine to skip it and check both days on the next. If you're going to be out for a week, arrange in #tvm-ci-failures to swap rotations with someone. If you find the workload too much, you can resign from the rotation or scale back your participation.
  • Rotations are a week long in order to make it simpler to know if you're on or not and give people time to work out their own flows.

Joining

Interested in helping out? Join the Discord and announce your interest to the team in the #tvm-ci-failures channel.

Clone this wiki locally