Skip to content

Add results of graceful recovery test #1341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 19 additions & 30 deletions tests/graceful-recovery/graceful-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,18 +34,18 @@ Ensure that NGF can recover gracefully from container failures without any user
3. Check out the latest tag (unless you are installing the edge version from the main branch).
4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`.
This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md)
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/site/content/installation/installing-ngf/manifests.md)
to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
6. In a separate terminal track NGF logs.

```console
kubectl -n nginx-gateway logs -f deploy/nginx-gateway
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx-gateway
```

7. In a separate terminal track NGINX container logs.

```console
kubectl -n nginx-gateway logs -f <NGF_POD> -c nginx
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx
```

8. In a separate terminal Exec into the NGINX container inside the NGF pod.
Expand All @@ -56,9 +56,7 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan

9. In a different terminal, deploy the
[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see
if the configuration and version were correctly updated.
11. Send traffic through the example application and ensure it is working correctly.
10. Send traffic through the example application and ensure it is working correctly.

### Run the tests

Expand All @@ -80,25 +78,22 @@ if the configuration and version were correctly updated.
4. Check for errors in the NGF and NGINX container logs.
5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
6. Open up the NGF and NGINX container logs and check for errors.
7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`.
8. Send traffic through the example application and ensure it is working correctly.
9. Check that NGF can still process changes of resources.
7. Send traffic through the example application and ensure it is working correctly.
8. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
3. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart NGINX container

Expand All @@ -113,24 +108,21 @@ if the configuration and version were correctly updated.

4. When NGINX container is back up, ensure traffic flows through the example application correctly.
5. Open up the NGINX container logs and check for errors.
6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
7. Check that NGF can still process changes of resources.
6. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
3. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart Node with draining

Expand All @@ -156,26 +148,23 @@ if the configuration and version were correctly updated.
docker restart kind-control-plane
```

7. Open up both NGF and NGINX container logs and check for errors.
8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
9. Send traffic through the example application and ensure it is working correctly.
10. Check that NGF can still process changes of resources.
7. Check the logs of the old and new NGF and NGINX containers for errors.
8. Send traffic through the example application and ensure it is working correctly.
9. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
4. Apply the HTTPRoute resources.
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
3. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart Node without draining

Expand Down
139 changes: 139 additions & 0 deletions tests/graceful-recovery/results/1.1.0/1.1.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Results for v1.1.0

<!-- TOC -->
- [Results for v1.1.0](#results-for-v110)
- [Summary](#summary)
- [Versions](#versions)
- [Tests](#tests)
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
- [Restart NGINX container](#restart-nginx-container)
- [Restart Node with draining](#restart-node-with-draining)
- [Restart Node without draining](#restart-node-without-draining)
- [Future Improvements](#future-improvements)
<!-- TOC -->


## Summary

- No new issues since 1.0.
- One new error in the [Restart Node with draining](#restart-node-with-draining) test, but it is not actionable.

## Versions

NGF version:


```text
commit: d6bbdba28a0f9ae3f75864855b76b0fb34bee3e5
date: 2023-12-05T18:43:51Z
version: edge
```

with NGINX:

```text
nginx/1.25.3
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
OS: Linux 5.15.49-linuxkit-pr
```


Kubernetes:

```text
Server Version: version.Info{Major:"1", Minor:"28",
GitVersion:"v1.28.0",
GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d",
GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z",
GoVersion:"go1.20.7", Compiler:"gc",
Platform:"linux/arm64"}
```

## Tests

### Restart nginx-gateway container

No errors.

### Restart NGINX container

The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process.
The following appeared in the NGINX logs:

```text
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 22:18:45 [notice] 116#116: try again to bind() after 500ms
```

NGF cannot update NGINX after this and logs the following error:

```text
{
"level": "error",
"ts": "2023-12-05T22:19:53Z",
"logger": "eventLoop.eventHandler",
"msg": "Failed to update NGINX configuration",
"batchID": 22,
"error": "failed to reload NGINX: open /proc/19/task/19/children: no such file or directory",
"stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"
}
```

Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108


### Restart Node with draining

Previous NGF container error:

```json
{
"level": "error",
"ts": "2023-12-05T21:43:31Z",
"logger": "eventLoop.eventHandler",
"msg": "Failed to update NGINX configuration",
"batchID": 11,
"error": "failed to reload NGINX: could not get expected config version 7: error getting client: Get \"http://config-version/version\": dial unix /var/run/nginx/nginx-config-version.sock: connect: no such file or directory",
"stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"
}
```

This error is likely due to NGINX terminating during a reload attempt and does not consistently occur on a node restart.

No errors in previous NGINX container.
No errors in new NGF/NGINX containers.

### Restart Node without draining

The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`.

The following appeared in the NGINX logs:

```text
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
2023/12/05 21:53:51 [emerg] 29#29: still could not bind()
```

The following appeared in the NGF logs:

```text
failed to start control loop: cannot create nginx metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
```

Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108

## Future Improvements

- None