Skip to content

Scale test #1115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Restructure to match longevity report
  • Loading branch information
Kate Osborn committed Oct 10, 2023
commit 1f7a27120019dc5daedc3afda174126991a096ba
Original file line number Diff line number Diff line change
@@ -1,6 +1,47 @@
# Test Results Summary
# Results for v1.0.0

## Version 1.0
<!-- TOC -->
- [Results for v1.0.0](#results-for-v100)
- [Versions](#versions)
- [Tests](#tests)
- [Scale Listeners](#scale-listeners)
- [Scale HTTPS Listeners](#scale-https-listeners)
- [Scale HTTPRoutes](#scale-httproutes)
- [Scale Upstream Servers](#scale-upstream-servers)
- [Scale HTTP Matches](#scale-http-matches)
<!-- TOC -->

## Versions

NGF version:

```text
commit: "72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
date: "2023-10-04T22:22:09Z"
version: "edge"
```


with NGINX:

```text
nginx/1.25.2
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
OS: Linux 5.15.109+
```

Kubernetes:

```text
Server Version: version.Info{Major:"1", Minor:"27",
GitVersion:"v1.27.6-gke.1248000",
GitCommit:"85a90ed8e702b392003d6757917e4cc167776e03",
GitTreeState:"clean", BuildDate:"2023-09-21T22:16:57Z",
GoVersion:"go1.20.8 X:boringcrypto", Compiler:"gc",
Platform:"linux/amd64"}
```

## Tests

### Scale Listeners

Expand All @@ -15,13 +56,13 @@
**Pod Restarts**: None.

**CPU**: Steep linear increase as NGF processed all the Services. Dropped off during scaling of Listeners.
See [graph](/tests/scale/results/1.0/TestScale_Listeners/CPU.png).
See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/CPU.png).

**Memory**: Gradual increase in memory. Topped out at 40MiB.
See [graph](/tests/scale/results/1.0/TestScale_Listeners/Memory.png).
See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/Memory.png).

**Time To Ready**: Time to ready numbers consistently under 3s. 62nd Listener had longest TTR of 3.02s.
See [graph](/tests/scale/results/1.0/TestScale_Listeners/TTR.png).
See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/TTR.png).

### Scale HTTPS Listeners

Expand All @@ -36,14 +77,14 @@ See [graph](/tests/scale/results/1.0/TestScale_Listeners/TTR.png).
**Pod Restarts**: None.

**CPU**: Steep linear increase as NGF processed all the Services and Secrets. Dropped off during scaling of Listeners.
See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/CPU.png).
See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/CPU.png).

**Memory**: Mostly linear increase. Topping out at right under 50MiB.
See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/Memory.png).
See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/Memory.png).

**Time To Ready**: The time to ready numbers were pretty consistent (under 3 sec) except for one spike of 10s. I believe
this spike was client-side because the NGF logs indicated that the reload successfully happened under 3s.
See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/TTR.png).
See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/TTR.png).

### Scale HTTPRoutes

Expand All @@ -58,10 +99,10 @@ See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/TTR.png).
**Pod Restarts**: None.

**CPU**: CPU mostly oscillated between .04 and .06. Several spikes over .06.
See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/CPU.png).
See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/CPU.png).

**Memory**: Memory usage gradually increased from 25 - 150MiB over course of the test with some spikes reaching up to
200MiB. See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/Memory.png).
200MiB. See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/Memory.png).

**Time To Ready**: This time to ready graph is unique because there are three plotted lines:

Expand All @@ -81,7 +122,7 @@ Related issues:
- https://github.com/nginxinc/nginx-gateway-fabric/issues/1013
- https://github.com/nginxinc/nginx-gateway-fabric/issues/825

See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/TTR.png).
See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/TTR.png).

### Scale Upstream Servers

Expand All @@ -96,10 +137,10 @@ See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/TTR.png).
**Pod Restarts**: None.

**CPU**: CPU steeply increases as NGF handles all the new Pods. Drops after they are processed.
See [graph](/tests/scale/results/1.0/TestScale_UpstreamServers/CPU.png).
See [graph](/tests/scale/results/1.0.0/TestScale_UpstreamServers/CPU.png).

**Memory**: Memory stays relatively flat and under 40MiB.
See [graph](/tests/scale/results/1.0/TestScale_UpstreamServers/Memory.png).
See [graph](/tests/scale/results/1.0.0/TestScale_UpstreamServers/Memory.png).

### Scale HTTP Matches

Expand Down
103 changes: 73 additions & 30 deletions tests/scale/README.md → tests/scale/scale.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,51 @@
# Scale Tests

## Setup
This document describes how we scale test NGF.

<!-- TOC -->
- [Scale Tests](#scale-tests)
- [Goals](#goals)
- [Test Environment](#test-environment)
- [Steps](#steps)
- [Setup](#setup)
- [Run the tests](#run-the-tests)
- [Scale Listeners to Max of 64](#scale-listeners-to-max-of-64)
- [Scale HTTPS Listeners to Max of 64](#scale-https-listeners-to-max-of-64)
- [Scale HTTPRoutes](#scale-httproutes)
- [Scale Upstream Servers](#scale-upstream-servers)
- [Scale HTTP Matches](#scale-http-matches)
- [Analyze](#analyze)
- [Results](#results)
<!-- TOC -->

## Goals

- Measure how NGF performs when the number of Gateway API and referenced core Kubernetes resources are scaled.
- Test the following number of resources:
- Max number of HTTP and HTTPS Listeners (64)
- Max number of Upstream Servers (648)
- Max number of HTTPMatches
- 1000 HTTPRoutes

## Test Environment

For most of the tests, the following cluster will be sufficient:

- A Kubernetes cluster with 4 nodes on GKE
- Node: n2d-standard-8 (8 vCPU, 32GB memory)
- Enabled GKE logging

The Upstream Server scale test requires a bigger cluster to accommodate the large number of Pods. Those cluster details
are listed in the [Scale Upstream Servers](#scale-upstream-servers) test steps.

- 32 vCPUs
- 128 GB total memory
- us-west2-b
- 1.27.5-gke.200

## Steps

- Create a GKE Cluster using the following details as a guide:
- 4 n2d-standard-8 nodes
- 32 vCPUs
- 128 GB total memory
- us-west2-b
- 1.27.5-gke.200
### Setup

- Install Gateway API Resources:

Expand All @@ -29,12 +67,16 @@
- Install Prometheus:

```console
kubectl apply -f prom-clusterrole.yaml
kubectl apply -f manifets/prom-clusterrole.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom prometheus-community/prometheus --set useExistingClusterRoleName=prometheus -n prom
```

- Create a directory under [results](/tests/scale/results) and name it after the version of NGF you are testing. Then
create a file for the result summary, also named after the NGF version. For
example: [1.0.0.md](/tests/scale/results/1.0.0/1.0.0.md).

### Run the tests

#### Scale Listeners to Max of 64
Expand Down Expand Up @@ -66,7 +108,7 @@ Follow the steps below to run the test:
go test -v -tags scale -run TestScale_Listeners -i 64
```

- [Collect and Record Metrics](#collecting-metrics).
- [Analyze](#analyze) the results.

- Clean up::

Expand Down Expand Up @@ -116,7 +158,7 @@ Follow the steps below to run the test:
go test -v -tags scale -run TestScale_HTTPSListeners -i 64
```

- [Collect and Record Metrics](#collecting-metrics).
- [Analyze](#analyze) the results.

- Clean up:

Expand Down Expand Up @@ -166,7 +208,7 @@ Follow the steps below to run the test:
go test -v -tags scale -timeout 600m -run TestScale_HTTPRoutes -i 1000 -delay 2s
```

- [Collect and Record Metrics](#collecting-metrics).
- [Analyze](#analyze) the results.

- Clean up:

Expand Down Expand Up @@ -201,14 +243,13 @@ Total Resources Created:
- 1 HTTPRoutes
- 1 Service, 1 Deployment, 648 Pods

For this test you must use a much bigger cluster in order to create 648 Pods. Use the following cluster details as a
guide:
Test Environment:

- 12 n2d-standard-16
- 192 vCPUs
- 768 GB total memory
- us-west2-b
- 1.27.6-gke.1248000
For this test you must use a much bigger cluster in order to create 648 Pods.

- A Kubernetes cluster with 12 nodes on GKE
- Node: n2d-standard-16 (16 vCPU, 64GB memory)
- Enabled GKE logging

Follow the steps below to run the test:

Expand All @@ -225,8 +266,7 @@ Follow the steps below to run the test:
kubectl describe httproute route
```

- Get the start time as a UNIX timestamp and record it in the
results [summary](/tests/scale/results/summary.md#upstream-servers):
- Get the start time as a UNIX timestamp and record it in the results.

```console
date +%s
Expand Down Expand Up @@ -265,9 +305,9 @@ Follow the steps below to run the test:
```

- In the terminal you started the request loop, kill the loop if it's still running and check the request.log to see if
any of the requests failed. Record any failures in the [summary](/tests/scale/results/summary.md#upstream-servers).
any of the requests failed. Record any failures in the results file.

- [Collect and Record Metrics](#collecting-metrics). Use the start time and end time you made note of earlier for the
- [Analyze](#analyze) the results. Use the start time and end time you made note of earlier for the
queries. You can calculate the test duration in seconds by subtracting the start time from the end time.

- Clean up:
Expand Down Expand Up @@ -327,15 +367,15 @@ Follow these steps to run the test:
./wrk -t2 -c10 -d30 http://cafe.example.com -H "header-50: header-50-val"
```

- Copy and paste the results to the [summary](/tests/scale/results/summary.md#scale-http-matches).
- Copy and paste the results into the results file.

- Clean up::

```console
kubectl delete -f manifests/scale-matches.yaml
```

### Collecting Metrics
### Analyze

- Query Prometheus for reload metrics. To access the Prometheus Server, run:

Expand All @@ -349,7 +389,7 @@ Follow these steps to run the test:

> Note:
> For the tests that write to a csv file, the `Test Start`, `Test End + 10s`, and `Duration` are at the
> end of the results.csv file in the results/<NGF_VERSION/<TEST_NAME> directory.
> end of the results.csv file in the `results/<NGF_VERSION>/<TEST_NAME>` directory.
> We are using `Test End + 10s` in the Prometheus query to account for the 10s scraping interval.

Total number of reloads:
Expand All @@ -371,7 +411,7 @@ Follow these steps to run the test:
rate(nginx_gateway_fabric_nginx_reloads_milliseconds_count[<Duration>] @ <Test End + 10s>)
```

Record these numbers in a table in the [results summary](/tests/scale/results/summary.md) doc.
Record these numbers in a table in the results file.

- Take screenshots of memory and CPU usage in GKE Dashboard

Expand All @@ -380,7 +420,7 @@ Follow these steps to run the test:

- Convert the `Start Time` and `End Time` UNIX timestamps to a date time using: https://www.epochconverter.com/.
- Create a custom time frame for the graphs in GKE.
- Take a screenshot of the CPU and Memory graphs individually. Store them in the results/<NGF_VERSION>/<TEST_NAME>
- Take a screenshot of the CPU and Memory graphs individually. Store them in the `results/<NGF_VERSION>/<TEST_NAME>`
directory.

- If the test writes time to ready numbers to a csv, create a time to ready graph.
Expand All @@ -391,8 +431,11 @@ Follow these steps to run the test:
- Set the Y axis to the Time to Ready column.
- Set the X axis to the number of resources column.
- Label the graph and take a screenshot.
- Store the graph in the results/<TEST_NAME> directory.
- Store the graph in the `results/<NGF_VERSION>/<TEST_NAME>` directory.

- Check for errors or restarts and record in the [results summary](/tests/scale/results/summary.md) doc. File a bug if
there's unexpected errors or restarts.
- Check for errors or restarts and record in the results file. File a bug if there's unexpected errors or restarts.
- Check NGINX conf and make sure it looks correct. File a bug if there is an issue.

### Results

- [1.0.0](/tests/scale/results/1.0.0/1.0.0.md)