Skip to content

Commit 3cab370

Browse files
authored
Add results of graceful recovery test (#1341)
Problem: We need to run the graceful recovery tests for release 1.1. Solution: Run the graceful recovery tests and record the results.
1 parent 44f0fae commit 3cab370

File tree

2 files changed

+158
-30
lines changed

2 files changed

+158
-30
lines changed

tests/graceful-recovery/graceful-recovery.md

+19-30
Original file line numberDiff line numberDiff line change
@@ -34,18 +34,18 @@ Ensure that NGF can recover gracefully from container failures without any user
3434
3. Check out the latest tag (unless you are installing the edge version from the main branch).
3535
4. Go into `deploy/manifests/nginx-gateway.yaml` and change `runAsNonRoot` from `true` to `false`.
3636
This allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
37-
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/docs/installation.md)
37+
5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/site/content/installation/installing-ngf/manifests.md)
3838
to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
3939
6. In a separate terminal track NGF logs.
4040

4141
```console
42-
kubectl -n nginx-gateway logs -f deploy/nginx-gateway
42+
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx-gateway
4343
```
4444

4545
7. In a separate terminal track NGINX container logs.
4646

4747
```console
48-
kubectl -n nginx-gateway logs -f <NGF_POD> -c nginx
48+
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx
4949
```
5050

5151
8. In a separate terminal Exec into the NGINX container inside the NGF pod.
@@ -56,9 +56,7 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan
5656

5757
9. In a different terminal, deploy the
5858
[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
59-
10. Inside the NGINX container, navigate to `/etc/nginx/conf.d` and check `http.conf` and `config-version.config` to see
60-
if the configuration and version were correctly updated.
61-
11. Send traffic through the example application and ensure it is working correctly.
59+
10. Send traffic through the example application and ensure it is working correctly.
6260

6361
### Run the tests
6462

@@ -80,25 +78,22 @@ if the configuration and version were correctly updated.
8078
4. Check for errors in the NGF and NGINX container logs.
8179
5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
8280
6. Open up the NGF and NGINX container logs and check for errors.
83-
7. Inside the NGINX container, check that `http.conf` was not changed and `config-version.conf` had its version set to `2`.
84-
8. Send traffic through the example application and ensure it is working correctly.
85-
9. Check that NGF can still process changes of resources.
81+
7. Send traffic through the example application and ensure it is working correctly.
82+
8. Check that NGF can still process changes of resources.
8683
1. Delete the HTTPRoute resources.
8784

8885
```console
8986
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
9087
```
9188

92-
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
93-
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
94-
4. Apply the HTTPRoute resources.
89+
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
90+
3. Apply the HTTPRoute resources.
9591

9692
```console
9793
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
9894
```
9995

100-
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
101-
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
96+
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
10297

10398
#### Restart NGINX container
10499

@@ -113,24 +108,21 @@ if the configuration and version were correctly updated.
113108

114109
4. When NGINX container is back up, ensure traffic flows through the example application correctly.
115110
5. Open up the NGINX container logs and check for errors.
116-
6. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
117-
7. Check that NGF can still process changes of resources.
111+
6. Check that NGF can still process changes of resources.
118112
1. Delete the HTTPRoute resources.
119113

120114
```console
121115
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
122116
```
123117

124-
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
125-
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
126-
4. Apply the HTTPRoute resources.
118+
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
119+
3. Apply the HTTPRoute resources.
127120

128121
```console
129122
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
130123
```
131124

132-
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
133-
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
125+
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
134126

135127
#### Restart Node with draining
136128

@@ -156,26 +148,23 @@ if the configuration and version were correctly updated.
156148
docker restart kind-control-plane
157149
```
158150

159-
7. Open up both NGF and NGINX container logs and check for errors.
160-
8. Exec back into the NGINX container and check that `http.conf` and `config-version.conf` were not changed.
161-
9. Send traffic through the example application and ensure it is working correctly.
162-
10. Check that NGF can still process changes of resources.
151+
7. Check the logs of the old and new NGF and NGINX containers for errors.
152+
8. Send traffic through the example application and ensure it is working correctly.
153+
9. Check that NGF can still process changes of resources.
163154
1. Delete the HTTPRoute resources.
164155

165156
```console
166157
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
167158
```
168159

169-
2. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
170-
3. Send traffic through the example application using the updated resources and ensure traffic does not flow.
171-
4. Apply the HTTPRoute resources.
160+
2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
161+
3. Apply the HTTPRoute resources.
172162

173163
```console
174164
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
175165
```
176166

177-
5. Inside the NGINX container, check that `http.conf` and `config-version.conf` were correctly updated.
178-
6. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
167+
4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.
179168

180169
#### Restart Node without draining
181170

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Results for v1.1.0
2+
3+
<!-- TOC -->
4+
- [Results for v1.1.0](#results-for-v110)
5+
- [Summary](#summary)
6+
- [Versions](#versions)
7+
- [Tests](#tests)
8+
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
9+
- [Restart NGINX container](#restart-nginx-container)
10+
- [Restart Node with draining](#restart-node-with-draining)
11+
- [Restart Node without draining](#restart-node-without-draining)
12+
- [Future Improvements](#future-improvements)
13+
<!-- TOC -->
14+
15+
16+
## Summary
17+
18+
- No new issues since 1.0.
19+
- One new error in the [Restart Node with draining](#restart-node-with-draining) test, but it is not actionable.
20+
21+
## Versions
22+
23+
NGF version:
24+
25+
26+
```text
27+
commit: d6bbdba28a0f9ae3f75864855b76b0fb34bee3e5
28+
date: 2023-12-05T18:43:51Z
29+
version: edge
30+
```
31+
32+
with NGINX:
33+
34+
```text
35+
nginx/1.25.3
36+
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
37+
OS: Linux 5.15.49-linuxkit-pr
38+
```
39+
40+
41+
Kubernetes:
42+
43+
```text
44+
Server Version: version.Info{Major:"1", Minor:"28",
45+
GitVersion:"v1.28.0",
46+
GitCommit:"855e7c48de7388eb330da0f8d9d2394ee818fb8d",
47+
GitTreeState:"clean", BuildDate:"2023-08-15T21:26:40Z",
48+
GoVersion:"go1.20.7", Compiler:"gc",
49+
Platform:"linux/arm64"}
50+
```
51+
52+
## Tests
53+
54+
### Restart nginx-gateway container
55+
56+
No errors.
57+
58+
### Restart NGINX container
59+
60+
The NGF Pod was unable to recover after sending a SIGKILL signal to the NGINX master process.
61+
The following appeared in the NGINX logs:
62+
63+
```text
64+
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-config-version.sock failed (98: Address in use)
65+
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address in use)
66+
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/lib/nginx/nginx-500-server.sock failed (98: Address in use)
67+
2023/12/05 22:18:45 [emerg] 116#116: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
68+
2023/12/05 22:18:45 [notice] 116#116: try again to bind() after 500ms
69+
```
70+
71+
NGF cannot update NGINX after this and logs the following error:
72+
73+
```text
74+
{
75+
"level": "error",
76+
"ts": "2023-12-05T22:19:53Z",
77+
"logger": "eventLoop.eventHandler",
78+
"msg": "Failed to update NGINX configuration",
79+
"batchID": 22,
80+
"error": "failed to reload NGINX: open /proc/19/task/19/children: no such file or directory",
81+
"stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"
82+
}
83+
```
84+
85+
Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
86+
87+
88+
### Restart Node with draining
89+
90+
Previous NGF container error:
91+
92+
```json
93+
{
94+
"level": "error",
95+
"ts": "2023-12-05T21:43:31Z",
96+
"logger": "eventLoop.eventHandler",
97+
"msg": "Failed to update NGINX configuration",
98+
"batchID": 11,
99+
"error": "failed to reload NGINX: could not get expected config version 7: error getting client: Get \"http://config-version/version\": dial unix /var/run/nginx/nginx-config-version.sock: connect: no such file or directory",
100+
"stacktrace": "github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:116\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"
101+
}
102+
```
103+
104+
This error is likely due to NGINX terminating during a reload attempt and does not consistently occur on a node restart.
105+
106+
No errors in previous NGINX container.
107+
No errors in new NGF/NGINX containers.
108+
109+
### Restart Node without draining
110+
111+
The NGF Pod was unable to recover the majority of times after running `docker restart kind-control-plane`.
112+
113+
The following appeared in the NGINX logs:
114+
115+
```text
116+
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
117+
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
118+
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
119+
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
120+
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
121+
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
122+
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
123+
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
124+
2023/12/05 21:53:51 [emerg] 29#29: bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use)
125+
2023/12/05 21:53:51 [notice] 29#29: try again to bind() after 500ms
126+
2023/12/05 21:53:51 [emerg] 29#29: still could not bind()
127+
```
128+
129+
The following appeared in the NGF logs:
130+
131+
```text
132+
failed to start control loop: cannot create nginx metrics collector: failed to get http://config-status/stub_status: Get "http://config-status/stub_status": dial unix /var/run/nginx/nginx-status.sock: connect: connection refused
133+
```
134+
135+
Known issue: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108
136+
137+
## Future Improvements
138+
139+
- None

0 commit comments

Comments
 (0)