Skip to content

Add Graceful Recovery Baseline Test #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions tests/graceful-recovery/graceful-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Graceful recovery from restarts

This document describes how we test graceful recovery from restarts on NGF.

<!-- TOC -->
- [Graceful recovery from restarts](#graceful-recovery-from-restarts)
- [Goal](#goal)
- [Test Environment](#test-environment)
- [Steps](#steps)
- [Setup](#setup)
- [Run the tests](#run-the-tests)
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
- [Restart NGINX container](#restart-nginx-container)
- [Restart Node with draining](#restart-node-with-draining)
- [Restart Node without draining](#restart-node-without-draining)
<!-- TOC -->

## Goal

Ensure that NGF can recover gracefully from container failures without any user intervention.

## Test Environment

- A Kubernetes cluster with 3 nodes on GKE
- Node: e2-medium (2 vCPU, 4GB memory)
- A Kind cluster

## Steps

### Setup

1. Setup GKE Cluster.
2. Clone the repo and change into the nginx-gateway-fabric directory.
3. Check out the latest tag (unless you are installing the edge version from the main branch).
4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`.
This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md)
to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
6. In a separate terminal track NGF logs.

```console
kubectl -n nginx-gateway logs -f deploy/nginx-gateway
```

7. In a separate terminal track NGINX container logs.

```console
kubectl -n nginx-gateway logs -f <NGF_POD> -c nginx
```

8. In a separate terminal Exec into the NGINX container inside the NGF pod.

```console
kubectl exec -it -n nginx-gateway <NGF_POD> --container nginx -- sh
```

9. In a different terminal, deploy the
[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see
if the configuration and version were correctly updated.
11. Send traffic through the example application and ensure it is working correctly.

### Run the tests

#### Restart nginx-gateway container

1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
2. Insert ephemeral container in NGF Pod.

```console
kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
```

3. Kill nginx-gateway process through a SIGKILL signal (Process command should start with `/usr/bin/gateway`).

```console
kill -9 <nginx-gateway_PID>
```

4. Check for errors in the NGF and NGINX container logs.
5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
6. Open up the NGF and NGINX container logs and check for errors.
7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`.
8. Send traffic through the example application and ensure it is working correctly.
9. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart NGINX container

1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container.
3. Inside the NGINX container, kill the nginx-master process through a SIGKILL signal
(Process command should start with `nginx: master process`).

```console
kill -9 <nginx-master_PID>
```

4. When NGINX container is back up, ensure traffic flows through the example application correctly.
5. Open up the NGINX container logs and check for errors.
6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
7. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart Node with draining

1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory.
2. Run steps 4-11 of the [Setup](#setup) section above using
[this guide](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/running-on-kind.md) for running on Kind.
3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
4. Drain the Node of its resources.

```console
kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data
```

5. Delete the Node.

```console
kubectl delete node kind-control-plane
```

6. Restart the Docker container.

```console
docker restart kind-control-plane
```

7. Open up both NGF and NGINX container logs and check for errors.
8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
9. Send traffic through the example application and ensure it is working correctly.
10. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart Node without draining

1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node.
142 changes: 142 additions & 0 deletions tests/graceful-recovery/results/1.0.0/1.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Results for v1.0.0

<!-- TOC -->
- [Results for v1.0.0](#results-for-v100)
- [Versions](#versions)
- [Tests](#tests)
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
- [Restart NGINX container](#restart-nginx-container)
- [Restart Node with draining](#restart-node-with-draining)
- [Restart Node without draining](#restart-node-without-draining)
- [Future Improvements](#future-improvements)
<!-- TOC -->


## Versions

NGF version:

```text
commit: 72b6c6ef8915c697626eeab88fdb6a3ce15b8da0
date: 2023-10-02T13:13:08Z
version: edge
```

with NGINX:

```text
nginx/1.25.2
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
OS: Linux 5.15.49-linuxkit-pr
```


Kubernetes:

```text
Server Version: version.Info{Major:"1", Minor:"28",
GitVersion:"v1.28.0",
GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d",
GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z",
GoVersion:"go1.20.7", Compiler:"gc",
Platform:"linux/arm64"}
```

## Tests

### Restart nginx-gateway container
Passes test with no errors.

### Restart NGINX container
The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process.
The following appeared in the NGINX logs:

```text
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/10/10 22:46:54 [emerg] 141#141: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:46:54 [notice] 141#141: try again to bind() after 500ms
2023/10/10 22:46:54 [emerg] 141#141: still could not bind()
```

Issue Filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108


### Restart Node with draining
Passes test with no errors.

### Restart Node without draining
The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`.

The following appeared in the NGINX logs:

```text
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
2023/10/10 22:57:05 [emerg] 140#140: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/10/10 22:57:05 [notice] 140#140: try again to bind() after 500ms
2023/10/10 22:57:05 [emerg] 140#140: still could not bind()
```

The following appeared in the NGF logs:

```text
{"level":"info","ts":"2023-10-10T22:57:05Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"b3fbf98d906f60ce66d70d7a2373c4b12b7d5606","date":"2023-10-10T22:02:06Z"}
Error: failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
Usage:
gateway static-mode [flags]

Flags:
-c, --config string The name of the NginxGateway resource to be used for this controller's dynamic configuration. Lives in the same Namespace as the controller. (default "")
--gateway string The namespaced name of the Gateway resource to use. Must be of the form: NAMESPACE/NAME. If not specified, the control plane will process all Gateways for the configured GatewayClass. However, among them, it will choose the oldest resource by creation timestamp. If the timestamps are equal, it will choose the resource that appears first in alphabetical order by {namespace}/{name}.
--health-disable Disable running the health probe server.
--health-port int Set the port where the health probe server is exposed. Format: [1024 - 65535] (default 8081)
-h, --help help for static-mode
--leader-election-disable Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources.
--leader-election-lock-name string The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock")
--metrics-disable Disable exposing metrics in the Prometheus format.
--metrics-port int Set the port where the metrics are exposed. Format: [1024 - 65535] (default 9113)
--metrics-secure-serving Enable serving metrics via https. By default metrics are served via http. Please note that this endpoint will be secured with a self-signed certificate.
--update-gatewayclass-status Update the status of the GatewayClass resource. (default true)

Global Flags:
--gateway-ctlr-name string The name of the Gateway controller. The controller name must be of the form: DOMAIN/PATH. The controller's domain is 'gateway.nginx.org' (default "")
--gatewayclass string The name of the GatewayClass resource. Every NGINX Gateway Fabric must have a unique corresponding GatewayClass resource. (default "")

failed to start control loop: cannot create and register metrics collectors: cannot create NGINX status metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
```

Important to note that occasionally the test will pass and the NGF Pod would recover gracefully.

Related to this issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108

## Future Improvements

- None