-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Document CI System #14656
[Docs] Document CI System #14656
Changes from 10 commits
c350cc6
ab1958b
18c9689
045f884
c1862b1
2922717
cb494c9
3561795
692ae3e
bddda15
58fafe4
4cbaa0b
85e4a29
61a4a05
dcd1b17
8719841
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,244 @@ | ||||||||||
# The CI Service in Hail Batch | ||||||||||
|
||||||||||
Hail Batch includes a CI service which has three functional purposes: | ||||||||||
|
||||||||||
- Runs tests against pull requests | ||||||||||
- Merges PRs to the `main` branch | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
- Deploys services to the live infrastructure | ||||||||||
|
||||||||||
## Structure and Operation | ||||||||||
|
||||||||||
The CI service itself is deployed as a Kubernetes service in the Hail Batch cluster. See | ||||||||||
[architecture diagram](../Hail%20Batch%20Architectural%20Diagram.png). | ||||||||||
|
||||||||||
As part of its configuration, CI must be configured with an access token allowing it to operate on | ||||||||||
behalf of a github account called hail-ci-robot. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. repetition of "configure"
Suggested change
|
||||||||||
|
||||||||||
### CI Update Loop | ||||||||||
|
||||||||||
The core CI update loop operates over a set of "watched branches" which each track three types of state flag which can | ||||||||||
be marked as dirty: | ||||||||||
- The state of github ("`github_changed`") | ||||||||||
- The state of the deployed system ("`batch_changed`") | ||||||||||
- The state of the watched branch itself (for example - its current commit) ("`state_changed`") | ||||||||||
|
||||||||||
In hail-is, there is exactly one watched branch, which is `hail-is/hail:main`. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make more sense to describe the deployment model before mentioning the organisation. Reading this, i was like "what does hail-is have to do with this". Alternatively, say something along the lines of "the default-namespace deployment of ci currently tracks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I don't particularly like documenting the current values of config variables as these go stale quickly. I'd link to the file where interested readers can find current values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I can move it and reword it. I want to strike a balance between "documenting something very general purpose" and "explaining how it is being used by us". I'll try again There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. This document will be useful for understanding how our CI deployment works on |
||||||||||
|
||||||||||
At a configurable frequency, the CI service will mark every flag for each of its watched branches as dirty, which | ||||||||||
will trigger a re-evaluation of the system state. | ||||||||||
|
||||||||||
In hail-is, this polling happens every 5 minutes. | ||||||||||
|
||||||||||
### Github Configuration | ||||||||||
|
||||||||||
To make CI more responsive it can be configured to receive webhook event triggers from the github repository. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, this is an accident of trying to be both a general purpose doc ("can be") vs how we use it ("is") |
||||||||||
|
||||||||||
These are configured manually within the github repository itself and not managed by terraform or deploy scripts. | ||||||||||
|
||||||||||
- For hail-is, the webhook target is: `ci.hail.is/github_callback` | ||||||||||
- For hail-is, webhook callbacks are configured to happen on changes to the following: | ||||||||||
- Pull request reviews | ||||||||||
- Pull requests | ||||||||||
- Pushes | ||||||||||
|
||||||||||
Depending on the type of webhook received, and the branch which it is targeting, the CI service will | ||||||||||
potentially dirty one of its watched branches' state flags, meaning that CI will prioritize re-evaluating | ||||||||||
the state of the branch in question. | ||||||||||
|
||||||||||
## Running tests against pull requests | ||||||||||
|
||||||||||
When a PR which targets a watched branch is created, updated or closed, the github_changed flag will be marked as dirty. | ||||||||||
|
||||||||||
During its update loop, CI will potentially decide to run tests against all PRs targetting a watched branch: | ||||||||||
|
||||||||||
- Authorization | ||||||||||
- If the author is in the [`ci/ci/constants`](../../../ci/ci/constants.py) list of known developers, it can be tested. | ||||||||||
- If the github SHA for the commit has been registered, it can be tested. | ||||||||||
- Otherwise, it will be ignored. | ||||||||||
- Check for validity | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reworded |
||||||||||
- The PR must be against a watched branch, or else it wouldn't be considered. | ||||||||||
- If the current commit has already had tests run against it, no further action is taken | ||||||||||
- If the PR is marked with the "do_not_test" label, no further action is taken | ||||||||||
- Run tests | ||||||||||
- The CI service will run tests against the PR (see below) | ||||||||||
- Report results back to Github using the Checks interface | ||||||||||
|
||||||||||
### Running Tests | ||||||||||
|
||||||||||
The process of running tests goes like: | ||||||||||
|
||||||||||
- CI will create a pending Check against the PR in question | ||||||||||
- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials | ||||||||||
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 I've mentioned this now |
||||||||||
- Tasks in the batch are defined in `build.yaml` in the top repo directory. CI generates jobs in the batch from the steps defined in there. | ||||||||||
- The batch contains jobs which will: | ||||||||||
- Clone the appropriate branch | ||||||||||
- Squash and rebase against `main` | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. more pedantry
Suggested change
|
||||||||||
- Build a new set of docker images for the updated services. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also builds and tests images for hail query |
||||||||||
- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster | ||||||||||
- These namespaces are named like `"pr-<PR_NUMBER>-<CI-NAMESPACE>-<RANDOM>"` | ||||||||||
- Where `CI-NAMESPACE` is the namespace where CI is running (usually `default`). | ||||||||||
- Run a series of tests against the services | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also stands up a private instance of the batch services and tests against those! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh i see you mentioned that below. nice. |
||||||||||
- Each test: | ||||||||||
- Submits a specially crafted batch to the newly deployed batch namespace | ||||||||||
- Checks that the results of running the batch are what we would expect | ||||||||||
- The CI service polls the batch which it submits for overall success or failure. | ||||||||||
- Once the batch has either succeeded or failed, CI uses that result to report status back to GitHub | ||||||||||
|
||||||||||
Examples of CI test runs can be seen by searching through the production batch log, as long as you are a member | ||||||||||
of the `ci` billing project. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1`. | ||||||||||
|
||||||||||
#### CI Testing Timeline | ||||||||||
|
||||||||||
The image below shows the CI testing timeline: | ||||||||||
|
||||||||||
```mermaid | ||||||||||
sequenceDiagram | ||||||||||
participant Github | ||||||||||
box Kubernetes (PROD: 'default' namespace) | ||||||||||
participant CI as PROD CI Service (default namespace) | ||||||||||
participant PB as PROD Batch (default namespace) | ||||||||||
end | ||||||||||
box GCE (PROD VMs) | ||||||||||
participant PVM as PROD VMs | ||||||||||
end | ||||||||||
box Kubernetes (CI: 'test-xyz' namespaces) | ||||||||||
participant CIB as CI Batch (test-xyz namespaces) | ||||||||||
end | ||||||||||
box GCE (CI VMs) | ||||||||||
participant CIVM as CI VMs | ||||||||||
end | ||||||||||
|
||||||||||
Github->>CI: Notify PR created/updated | ||||||||||
|
||||||||||
CI->>Github: Register github check (pending) | ||||||||||
CI->>PB: Submit CI test batch | ||||||||||
activate PB | ||||||||||
PB->>PVM: Submit build jobs | ||||||||||
activate PVM | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB->>PVM: Submit deploy jobs | ||||||||||
activate PVM | ||||||||||
PVM->>CIB: Deploy batch service | ||||||||||
activate CIB | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB->>PVM: Submit test jobs | ||||||||||
activate PVM | ||||||||||
PVM->>CIB: Submit test batches | ||||||||||
CIB->>CIVM: Submit test batch jobs | ||||||||||
activate CIVM | ||||||||||
CIVM-->>CIB: Validate results | ||||||||||
deactivate CIVM | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB->>PVM: Submit cleanup jobs | ||||||||||
activate PVM | ||||||||||
PVM->>CIB: Clean up service | ||||||||||
deactivate CIB | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB-->>CI: Detects completion | ||||||||||
CI->>Github: Update github check | ||||||||||
``` | ||||||||||
|
||||||||||
## Merging PRs to the `main` branch | ||||||||||
|
||||||||||
When a PR state changes, it will cause the `github_changed` flag to become dirty. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the |
||||||||||
|
||||||||||
During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
be merged. | ||||||||||
|
||||||||||
Readiness is determined by github status checks. The following conditions must be met: | ||||||||||
|
||||||||||
- The PR must be against the `main` branch | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line |
||||||||||
- The PR must have passed all tests in GCP | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
- The PR must be approved | ||||||||||
|
||||||||||
The control flow from final approval to CI merging a PRs looks like: | ||||||||||
|
||||||||||
- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved) | ||||||||||
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch` | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The watched branch which is the target of the PR? |
||||||||||
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) scans all PRs against the branch and updates state that helps determine mergeability. | ||||||||||
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) iterates again to merge all mergeable PRs, in priority order | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a subtlety here which is being glossed over. As I understand (without having inspected the code, so correct me if I'm wrong), we keep a PR's check set to SUCCESS even if the successful CI run was on top of an outdated main (i.e. another PR has merged since this PR was last tested). Rerunning the tests of all PRs every time any PR merges would be wasteful, but we don't want to merge a PR unles it has a successful CI run on top of the current main. My mental model is that CI chooses one PR to be the next merge candidate, and reruns the tests if the last run is out of date, then merges only if that suceeds. But I don't know for sure if that's right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, interesting. I didn't see evidence for this in the code but it seems logical. I also wonder whether this is why we sometimes see PRs get "stuck" with pending checks. Perhaps CI is just choosing to delay re-running until the PR is back on top of its mergeable list 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did a deep dive into this today, and you are indeed correct. I've added a flowchart showing the full set of paths through the update loop, which might be overkill in the end, but was a useful exercise for me in the process of making it. What we seem to do is -
So... I think it's possible that we might kick off a "merge candidate" test and then decide to merge some other PR while those tests are running (if its tests were already up to date) |
||||||||||
|
||||||||||
## Deploying services to the live infrastructure | ||||||||||
|
||||||||||
When a PR is merged into the `main` branch, a webhook will trigger. The CI service will set its `github_changed` flag. | ||||||||||
|
||||||||||
During its update loop, the CI service will determine that the SHA of the `WatchedBranch` has changed and trigger | ||||||||||
a deployment. | ||||||||||
|
||||||||||
- Create a new batch job to: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is also |
||||||||||
- Build various components and service images | ||||||||||
- Run various pre-deployment tests | ||||||||||
- Deploy to the `default` (ie prod) namespace | ||||||||||
- Run various post-deployment tests | ||||||||||
- A handful of final actions | ||||||||||
- eg rolling out artifact registry cleanup policies, amongst many other things | ||||||||||
|
||||||||||
Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail | ||||||||||
batch, and each package being built and service being deployed has its own path through the DAG. So it's quite possible | ||||||||||
that services are deploy/test-ing in parallel, and that the deploy for one service might happen before the test for | ||||||||||
another has completed. | ||||||||||
|
||||||||||
This should all be fine, because everything was previously tested as part of the PR approval process. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should emphasize here that a PR is only merged (really, squashed and rebased) if the exact commit that is added to main, with the exact same SHA, has passed CI. |
||||||||||
|
||||||||||
Examples of CI deploy runs can be seen by searching through the production batch log, as long as you have developer | ||||||||||
permissions. For example: `/batches?q=user+%3D+ci%0D%0Adeploy+%3D+1` | ||||||||||
|
||||||||||
#### CI Deploy Timeline | ||||||||||
|
||||||||||
The image below shows the CI deployment timeline: | ||||||||||
|
||||||||||
```mermaid | ||||||||||
sequenceDiagram | ||||||||||
participant Github | ||||||||||
box Kubernetes (PROD: 'default' namespace) | ||||||||||
participant CI as PROD CI Service (default namespace) | ||||||||||
participant PB as PROD Batch (default namespace) | ||||||||||
end | ||||||||||
box GCE (PROD VMs) | ||||||||||
participant PVM as PROD VMs | ||||||||||
end | ||||||||||
|
||||||||||
Github->>CI: Notify 'main' branch updated | ||||||||||
CI->>PB: Submit CI deploy batch | ||||||||||
activate PB | ||||||||||
PB->>PVM: Submit build jobs | ||||||||||
activate PVM | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB->>PVM: Submit deploy jobs | ||||||||||
activate PVM | ||||||||||
PVM->>PB: Redeploy batch service | ||||||||||
PVM->>CI: Redeploy CI service | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB->>PVM: Submit test jobs | ||||||||||
activate PVM | ||||||||||
PVM->>PB: Submit test batches | ||||||||||
PB->>PVM: Submit test batch jobs | ||||||||||
activate PVM | ||||||||||
PVM-->>PB: Validate results | ||||||||||
deactivate PVM | ||||||||||
PVM-->>PB: Done | ||||||||||
deactivate PVM | ||||||||||
PB-->>CI: Detects completion | ||||||||||
``` | ||||||||||
|
||||||||||
## Issues | ||||||||||
|
||||||||||
### Hanging PR state | ||||||||||
|
||||||||||
It is sometimes possible for a PR status to be stuck in pending even though the test batch has completed | ||||||||||
This can be fixed by pushing a new commit to the branch, but it's strange that regular polling does not catch | ||||||||||
these instances. | ||||||||||
|
||||||||||
## Open Questions | ||||||||||
|
||||||||||
1. Sometimes in logs we see logs like `"update github br-hail-ci-test-ci-test-<RANDOM>-main"` for various random branch names. | ||||||||||
What does this mean? | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could these branches be the results of squashing and rebasing a PR before running CI? |
||||||||||
2. Do we really deploy first / test second? And do we really deploy Batch and CI using jobs that are already running in | ||||||||||
Batch and CI? Do the services get shut down and reconnect to existing instances of the deploy jobs started by the | ||||||||||
previous version? | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't quite understand the question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is boarderline pedantry, but I don't think of "Hail Batch" being anything other than the batch service and the python client library. the batch service + ci etc are generally referred to as the "Services".