Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Document CI System #14656

Merged
merged 16 commits into from
Sep 26, 2024
244 changes: 244 additions & 0 deletions dev-docs/services/ci/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# The CI Service in Hail Batch

Hail Batch includes a CI service which has three functional purposes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is boarderline pedantry, but I don't think of "Hail Batch" being anything other than the batch service and the python client library. the batch service + ci etc are generally referred to as the "Services".

Suggested change
# The CI Service in Hail Batch
Hail Batch includes a CI service which has three functional purposes:
# The CI Service
The CI service has three functional purposes:


- Runs tests against pull requests
- Merges PRs to the `main` branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Merges PRs to the `main` branch
- Merges PRs to the `main` branch when all checks have passed

- Deploys services to the live infrastructure

## Structure and Operation

The CI service itself is deployed as a Kubernetes service in the Hail Batch cluster. See
[architecture diagram](../Hail%20Batch%20Architectural%20Diagram.png).

As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repetition of "configure"

Suggested change
As part of its configuration, CI must be configured with an access token allowing it to operate on
behalf of a github account called hail-ci-robot.
CI must be configured with an access token allowing it to operate on
behalf of the 'hail-ci-robot' github account.


### CI Update Loop

The core CI update loop operates over a set of "watched branches" which each track three types of state flag which can
be marked as dirty:
- The state of github ("`github_changed`")
- The state of the deployed system ("`batch_changed`")
- The state of the watched branch itself (for example - its current commit) ("`state_changed`")

In hail-is, there is exactly one watched branch, which is `hail-is/hail:main`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to describe the deployment model before mentioning the organisation. Reading this, i was like "what does hail-is have to do with this". Alternatively, say something along the lines of "the default-namespace deployment of ci currently tracks hail-is/hail:main".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't particularly like documenting the current values of config variables as these go stale quickly. I'd link to the file where interested readers can find current values.

Copy link
Collaborator Author

@cjllanwarne cjllanwarne Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can move it and reword it. I want to strike a balance between "documenting something very general purpose" and "explaining how it is being used by us". I'll try again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This document will be useful for understanding how our CI deployment works on hail-is/hail::main. I think whether some behavior is hard-coded or controlled by configuration is of secondary importance. If some config variable changes frequently, then agreed we should avoid talking about its current setting, but I don't think that's the case here.


At a configurable frequency, the CI service will mark every flag for each of its watched branches as dirty, which
will trigger a re-evaluation of the system state.

In hail-is, this polling happens every 5 minutes.

### Github Configuration

To make CI more responsive it can be configured to receive webhook event triggers from the github repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To make CI more responsive it can be configured to receive webhook event triggers from the github repository.
To make CI more responsive it is configured to receive webhook event triggers from the github repository.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is an accident of trying to be both a general purpose doc ("can be") vs how we use it ("is")


These are configured manually within the github repository itself and not managed by terraform or deploy scripts.

- For hail-is, the webhook target is: `ci.hail.is/github_callback`
- For hail-is, webhook callbacks are configured to happen on changes to the following:
- Pull request reviews
- Pull requests
- Pushes

Depending on the type of webhook received, and the branch which it is targeting, the CI service will
potentially dirty one of its watched branches' state flags, meaning that CI will prioritize re-evaluating
the state of the branch in question.

## Running tests against pull requests

When a PR which targets a watched branch is created, updated or closed, the github_changed flag will be marked as dirty.

During its update loop, CI will potentially decide to run tests against all PRs targetting a watched branch:

- Authorization
- If the author is in the [`ci/ci/constants`](../../../ci/ci/constants.py) list of known developers, it can be tested.
- If the github SHA for the commit has been registered, it can be tested.
- Otherwise, it will be ignored.
- Check for validity

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded

- The PR must be against a watched branch, or else it wouldn't be considered.
- If the current commit has already had tests run against it, no further action is taken
- If the PR is marked with the "do_not_test" label, no further action is taken
- Run tests
- The CI service will run tests against the PR (see below)
- Report results back to Github using the Checks interface

### Running Tests

The process of running tests goes like:

- CI will create a pending Check against the PR in question
- CI will generate a new batch and submit it against the production Hail Batch service using its own service credentials

This comment was marked as resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've mentioned this now

- Tasks in the batch are defined in `build.yaml` in the top repo directory. CI generates jobs in the batch from the steps defined in there.
- The batch contains jobs which will:
- Clone the appropriate branch
- Squash and rebase against `main`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more pedantry

Suggested change
- Squash and rebase against `main`
- Squash and rebase onto `main`

- Build a new set of docker images for the updated services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also builds and tests images for hail query

- Deploy the batch suite of k8s services into one of many CI-owned namespaces in the Hail Batch cluster
- These namespaces are named like `"pr-<PR_NUMBER>-<CI-NAMESPACE>-<RANDOM>"`
- Where `CI-NAMESPACE` is the namespace where CI is running (usually `default`).
- Run a series of tests against the services
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also stands up a private instance of the batch services and tests against those!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see you mentioned that below. nice.

- Each test:
- Submits a specially crafted batch to the newly deployed batch namespace
- Checks that the results of running the batch are what we would expect
- The CI service polls the batch which it submits for overall success or failure.
- Once the batch has either succeeded or failed, CI uses that result to report status back to GitHub

Examples of CI test runs can be seen by searching through the production batch log, as long as you are a member
of the `ci` billing project. For example: `/batches?q=user+%3D+ci%0D%0Atest+%3D+1`.

#### CI Testing Timeline

The image below shows the CI testing timeline:

```mermaid
sequenceDiagram
participant Github
box Kubernetes (PROD: 'default' namespace)
participant CI as PROD CI Service (default namespace)
participant PB as PROD Batch (default namespace)
end
box GCE (PROD VMs)
participant PVM as PROD VMs
end
box Kubernetes (CI: 'test-xyz' namespaces)
participant CIB as CI Batch (test-xyz namespaces)
end
box GCE (CI VMs)
participant CIVM as CI VMs
end

Github->>CI: Notify PR created/updated

CI->>Github: Register github check (pending)
CI->>PB: Submit CI test batch
activate PB
PB->>PVM: Submit build jobs
activate PVM
PVM-->>PB: Done
deactivate PVM
PB->>PVM: Submit deploy jobs
activate PVM
PVM->>CIB: Deploy batch service
activate CIB
PVM-->>PB: Done
deactivate PVM
PB->>PVM: Submit test jobs
activate PVM
PVM->>CIB: Submit test batches
CIB->>CIVM: Submit test batch jobs
activate CIVM
CIVM-->>CIB: Validate results
deactivate CIVM
PVM-->>PB: Done
deactivate PVM
PB->>PVM: Submit cleanup jobs
activate PVM
PVM->>CIB: Clean up service
deactivate CIB
PVM-->>PB: Done
deactivate PVM
PB-->>CI: Detects completion
CI->>Github: Update github check
```

## Merging PRs to the `main` branch

When a PR state changes, it will cause the `github_changed` flag to become dirty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the github_changed flag of the target branch (if that is a watched branch)?


During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During its update loop, the CI service will determine whether any PRs targeting its watched branches are elibible to
During its update loop, the CI service will determine whether any PRs targeting its watched branches are eligible to

be merged.

Readiness is determined by github status checks. The following conditions must be met:

- The PR must be against the `main` branch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it main, or any watched branch? Just since you started out making the distinction carefully, we should maintain it throughout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any watched branch, but this is automatically true or else we wouldn't be considering the PR for mergability in the first place... so I'll remove this line

- The PR must have passed all tests in GCP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The PR must have passed all tests in GCP
- The PR must have passed all required checks

- The PR must be approved

The control flow from final approval to CI merging a PRs looks like:

- The PR's state will change in github (either a check changes to SUCCESS, or a review is approved)
- The github webhook callback will cause the `github_changed` flag will be marked as dirty for the `WatchedBranch`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watched branch which is the target of the PR?

- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) scans all PRs against the branch and updates state that helps determine mergeability.
- The `WatchedBranch`'s `_update` method in [`github.py`](../../../ci/ci/github.py) iterates again to merge all mergeable PRs, in priority order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtlety here which is being glossed over. As I understand (without having inspected the code, so correct me if I'm wrong), we keep a PR's check set to SUCCESS even if the successful CI run was on top of an outdated main (i.e. another PR has merged since this PR was last tested). Rerunning the tests of all PRs every time any PR merges would be wasteful, but we don't want to merge a PR unles it has a successful CI run on top of the current main. My mental model is that CI chooses one PR to be the next merge candidate, and reruns the tests if the last run is out of date, then merges only if that suceeds. But I don't know for sure if that's right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. I didn't see evidence for this in the code but it seems logical.

I also wonder whether this is why we sometimes see PRs get "stuck" with pending checks. Perhaps CI is just choosing to delay re-running until the PR is back on top of its mergeable list 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a deep dive into this today, and you are indeed correct. I've added a flowchart showing the full set of paths through the update loop, which might be overkill in the end, but was a useful exercise for me in the process of making it.

What we seem to do is -

  • During heal:
    • Nominate a current merge candidate
      • If its tests are out of date, re-run them
  • During try_to_merge:
    • Iterate over all PRs
      • Test is_mergeable
      • Merge if mergeable (and return after the first success)

So... I think it's possible that we might kick off a "merge candidate" test and then decide to merge some other PR while those tests are running (if its tests were already up to date)


## Deploying services to the live infrastructure

When a PR is merged into the `main` branch, a webhook will trigger. The CI service will set its `github_changed` flag.

During its update loop, the CI service will determine that the SHA of the `WatchedBranch` has changed and trigger
a deployment.

- Create a new batch job to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is also build.yaml, just with a configuration variable set to make it deploy?

- Build various components and service images
- Run various pre-deployment tests
- Deploy to the `default` (ie prod) namespace
- Run various post-deployment tests
- A handful of final actions
- eg rolling out artifact registry cleanup policies, amongst many other things

Note: It's not actually quite as waterfall-y as this. In fact the jobs are all running in a hail
batch, and each package being built and service being deployed has its own path through the DAG. So it's quite possible
that services are deploy/test-ing in parallel, and that the deploy for one service might happen before the test for
another has completed.

This should all be fine, because everything was previously tested as part of the PR approval process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should emphasize here that a PR is only merged (really, squashed and rebased) if the exact commit that is added to main, with the exact same SHA, has passed CI.


Examples of CI deploy runs can be seen by searching through the production batch log, as long as you have developer
permissions. For example: `/batches?q=user+%3D+ci%0D%0Adeploy+%3D+1`

#### CI Deploy Timeline

The image below shows the CI deployment timeline:

```mermaid
sequenceDiagram
participant Github
box Kubernetes (PROD: 'default' namespace)
participant CI as PROD CI Service (default namespace)
participant PB as PROD Batch (default namespace)
end
box GCE (PROD VMs)
participant PVM as PROD VMs
end

Github->>CI: Notify 'main' branch updated
CI->>PB: Submit CI deploy batch
activate PB
PB->>PVM: Submit build jobs
activate PVM
PVM-->>PB: Done
deactivate PVM
PB->>PVM: Submit deploy jobs
activate PVM
PVM->>PB: Redeploy batch service
PVM->>CI: Redeploy CI service
PVM-->>PB: Done
deactivate PVM
PB->>PVM: Submit test jobs
activate PVM
PVM->>PB: Submit test batches
PB->>PVM: Submit test batch jobs
activate PVM
PVM-->>PB: Validate results
deactivate PVM
PVM-->>PB: Done
deactivate PVM
PB-->>CI: Detects completion
```

## Issues

### Hanging PR state

It is sometimes possible for a PR status to be stuck in pending even though the test batch has completed
This can be fixed by pushing a new commit to the branch, but it's strange that regular polling does not catch
these instances.

## Open Questions

1. Sometimes in logs we see logs like `"update github br-hail-ci-test-ci-test-<RANDOM>-main"` for various random branch names.
What does this mean?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these branches be the results of squashing and rebasing a PR before running CI?

2. Do we really deploy first / test second? And do we really deploy Batch and CI using jobs that are already running in
Batch and CI? Do the services get shut down and reconnect to existing instances of the deploy jobs started by the
previous version?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand the question.
There's one deploy job at a time that deploys already tested artefacts (albeit, tested in a different namespace).
k8s handles updating the various pods with their new images.
Sometimes, the services are brought down depending on database migrations. I think that's defined in build.yaml.

Loading