Skip to content

[CI] Update the documentation based on recent changes #446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions premerge/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,12 @@ Our self hosted runners come in two flavors:

## GCP runners - Architecture overview

Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
The cluster has 3 pools:
We have two clusters to compose a high availability setup. The description
below describes an individual cluster, but they are largely identical.
Any relevant differences are explicitly enumerated.

Our runners are hosted on GCP Kubernetes clustesr, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Our runners are hosted on GCP Kubernetes clustesr, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
Our runners are hosted on GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).

The clusters have 3 pools:
- llvm-premerge-linux
- llvm-premerge-linux-service
- llvm-premerge-windows
Expand All @@ -52,10 +56,12 @@ services required to manage the premerge infra (controller, listeners,
monitoring). Today, this pool has three `e2-highcpu-4` machine.

**llvm-premerge-linux** is a auto-scaling pool with large `n2-standard-64`
VMs. This pool runs the Linux workflows.
VMs. This pool runs the Linux workflows. In the US West cluster, the machines
are `n2d-standard-64` due to quota limitations.

**llvm-premerge-windows** is a auto-scaling pool with large `n2-standard-64`
VMs. Similar to the Linux pool, but this time it runs Windows workflows.
**llvm-premerge-windows** is a auto-scaling pool with large `n2-standard-32`
VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the
US West cluster, the machines are `n2d-standard-32` due to quota limitations.

### Service pool: llvm-premerge-linux-service

Expand Down
48 changes: 31 additions & 17 deletions premerge/cluster-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,19 @@ worst case cause an inconsistent state.

The main part you want too look into is `Menu > Kubernetes Engine > Clusters`.

Currently, we have 3 clusters:
Currently, we have 4 clusters:
- `llvm-premerge-checks`: the cluster hosting BuildKite Linux runners.
- `windows-cluster`: the cluster hosting BuildKite Windows runners.
- `llvm-premerge-prototype`: the cluster for those GCP hoster runners.
- `llvm-premerge-prototype`: The first cluster for GCP hosted runners.
- `llvm-premerge-cluster-us-west`: The second cluster for GCP hosted runners.

Yes, it's called `prototype`, but that's the production cluster.
We should rename it at some point.
Yes, one is called `prototype`, but that's the production cluster.
We should rename it at some point. We have two clusters for GCP hosted runners
to form a high availability setup. They both load balance, and if one fails
then the other will pick up the work. This also enables seamless migrations
and upgrades.

To add a VM to the cluster, the VM has to come from a `pool`. A `pool` is
To add a VM to a cluster, the VM has to come from a `pool`. A `pool` is
a group of nodes within a cluster that all have the same configuration.

For example:
Expand Down Expand Up @@ -88,16 +92,21 @@ To apply any changes to the cluster:
## Setting the cluster up for the first time

```
terraform apply -target google_container_node_pool.llvm_premerge_linux_service
terraform apply -target google_container_node_pool.llvm_premerge_linux
terraform apply -target google_container_node_pool.llvm_premerge_windows
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows
terraform apply
```

Setting the cluster up for the first time is more involved as there are certain
resources where terraform is unable to handle explicit dependencies. This means
that we have to set up the GKE cluster before we setup any of the Kubernetes
resources as otherwise the Terraform Kubernetes provider will error out.
resources as otherwise the Terraform Kubernetes provider will error out. This
needs to be done for both clusters before running the standard
`terraform apply`.

## Upgrading/Resetting Github ARC

Expand All @@ -119,16 +128,21 @@ queued on the Github side. Running build jobs will complete after the helm chart
are uninstalled unless they are forcibly killed. Note that best practice dictates
the helm charts should just be uninstalled rather than also setting `maxRunners`
to zero beforehand as that can cause ARC to accept some jobs but not actually
execute them which could prevent failover in HA cluster configurations.
execute them which could prevent failover in a HA cluster configuration like
ours.

### Uninstalling the Helm Charts

For the example commands below we will be modifying the cluster in
`us-central1-a`. You can replace `module.premerge_cluster_us_central` with
`module.premerge_cluster_us_west` to switch which cluster you are working on.

To begin, start by uninstalling the helm charts by using resource targetting
on a kubernetes destroy command:

```bash
terraform destroy -target helm_release.github_actions_runner_set_linux
terraform destroy -target helm_release.github_actions_runner_set_windows
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_set_linux
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_set_windows
```

These should complete, but if they do not, we are still able to get things
Expand All @@ -139,8 +153,8 @@ manually delete them with `kubectl delete`. Follow up the previous terraform
commands by deleting the kubernetes namespaces all the resources live in:

```bash
terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_linux_runners
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_windows_runners
```

If things go smoothly, these should complete quickly. If they do not complete,
Expand Down Expand Up @@ -184,17 +198,17 @@ version upgrades however.

Start by destroying the helm chart:
```bash
terraform destroy -target helm_release.github_actions_runner_controller
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_controller
```

Then delete the namespace to ensure there are no dangling resources
```bash
terraform destroy -target kubernetes_namespace.llvm_premerge_controller
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_controller
```

### Bumping the Version Number

This is not necessary only for bumping the version of ARC. This involves simply
This is necessary only for bumping the version of ARC. This involves simply
updating the version field for the `helm_release` objects in `main.tf`. Make sure
to commit the changes and push them to `llvm-zorg` to ensure others working on
the terraform configuration have an up to date state when they pull the repository.
Expand Down
2 changes: 1 addition & 1 deletion premerge/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Presubmit monitoring is provided by Grafana.
The dashboard link is [https://llvm.grafana.net/dashboards](https://llvm.grafana.net/dashboards).

Grafana pulls its data from 2 sources: the GCP Kubernetes cluster & GitHub.
Grafana pulls its data from 2 sources: the GCP Kubernetes clusters & GitHub.
Grafana instance access is restricted, but there is a publicly visible dashboard:
- [Public dashboard](https://llvm.grafana.net/public-dashboards/21c6e0a7cdd14651a90e118df46be4cc)

Expand Down