Restructure to match longevity report

nginx · kate-osborn · Oct 10, 2023 · Oct 6, 2023 · Oct 6, 2023 · Oct 6, 2023
commit 1f7a27120019dc5daedc3afda174126991a096ba
@@ -1,6 +1,47 @@
-# Test Results Summary
+# Results for v1.0.0
 
-## Version 1.0
+<!-- TOC -->
+- [Results for v1.0.0](#results-for-v100)
+  - [Versions](#versions)
+  - [Tests](#tests)
+    - [Scale Listeners](#scale-listeners)
+    - [Scale HTTPS Listeners](#scale-https-listeners)
+    - [Scale HTTPRoutes](#scale-httproutes)
+    - [Scale Upstream Servers](#scale-upstream-servers)
+    - [Scale HTTP Matches](#scale-http-matches)
+<!-- TOC -->
+
+## Versions
+
+NGF version:
+
+```text
+commit: "72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
+date: "2023-10-04T22:22:09Z"
+version: "edge"
+```
+
+
+with NGINX:
+
+```text
+nginx/1.25.2
+built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
+OS: Linux 5.15.109+
+```
+
+Kubernetes:
+
+```text
+Server Version: version.Info{Major:"1", Minor:"27",
+GitVersion:"v1.27.6-gke.1248000",
+GitCommit:"85a90ed8e702b392003d6757917e4cc167776e03",
+GitTreeState:"clean", BuildDate:"2023-09-21T22:16:57Z",
+GoVersion:"go1.20.8 X:boringcrypto", Compiler:"gc",
+Platform:"linux/amd64"}
+```
+
+## Tests
 
 ### Scale Listeners
 
@@ -15,13 +56,13 @@
 **Pod Restarts**: None.
 
 **CPU**: Steep linear increase as NGF processed all the Services. Dropped off during scaling of Listeners.
-See [graph](/tests/scale/results/1.0/TestScale_Listeners/CPU.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/CPU.png).
 
 **Memory**: Gradual increase in memory. Topped out at 40MiB.
-See [graph](/tests/scale/results/1.0/TestScale_Listeners/Memory.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/Memory.png).
 
 **Time To Ready**: Time to ready numbers consistently under 3s. 62nd Listener had longest TTR of 3.02s.
-See [graph](/tests/scale/results/1.0/TestScale_Listeners/TTR.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_Listeners/TTR.png).
 
 ### Scale HTTPS Listeners
 
@@ -36,14 +77,14 @@ See [graph](/tests/scale/results/1.0/TestScale_Listeners/TTR.png).
 **Pod Restarts**: None.
 
 **CPU**: Steep linear increase as NGF processed all the Services and Secrets. Dropped off during scaling of Listeners.
-See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/CPU.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/CPU.png).
 
 **Memory**: Mostly linear increase. Topping out at right under 50MiB.
-See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/Memory.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/Memory.png).
 
 **Time To Ready**: The time to ready numbers were pretty consistent (under 3 sec) except for one spike of 10s. I believe
 this spike was client-side because the NGF logs indicated that the reload successfully happened under 3s.
-See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/TTR.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_HTTPSListeners/TTR.png).
 
 ### Scale HTTPRoutes
 
@@ -58,10 +99,10 @@ See [graph](/tests/scale/results/1.0/TestScale_HTTPSListeners/TTR.png).
 **Pod Restarts**: None.
 
 **CPU**: CPU mostly oscillated between .04 and .06. Several spikes over .06.
-See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/CPU.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/CPU.png).
 
 **Memory**: Memory usage gradually increased from 25 - 150MiB over course of the test with some spikes reaching up to
-200MiB. See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/Memory.png).
+200MiB. See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/Memory.png).
 
 **Time To Ready**: This time to ready graph is unique because there are three plotted lines:
 
@@ -81,7 +122,7 @@ Related issues:
 - https://github.com/nginxinc/nginx-gateway-fabric/issues/1013
 - https://github.com/nginxinc/nginx-gateway-fabric/issues/825
 
-See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/TTR.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_HTTPRoutes/TTR.png).
 
 ### Scale Upstream Servers
 
@@ -96,10 +137,10 @@ See [graph](/tests/scale/results/1.0/TestScale_HTTPRoutes/TTR.png).
 **Pod Restarts**: None.
 
 **CPU**: CPU steeply increases as NGF handles all the new Pods. Drops after they are processed.
-See [graph](/tests/scale/results/1.0/TestScale_UpstreamServers/CPU.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_UpstreamServers/CPU.png).
 
 **Memory**: Memory stays relatively flat and under 40MiB.
-See [graph](/tests/scale/results/1.0/TestScale_UpstreamServers/Memory.png).
+See [graph](/tests/scale/results/1.0.0/TestScale_UpstreamServers/Memory.png).
 
 ### Scale HTTP Matches
 

@@ -1,13 +1,51 @@
 # Scale Tests
 
-## Setup
+This document describes how we scale test NGF.
+
+<!-- TOC -->
+- [Scale Tests](#scale-tests)
+  - [Goals](#goals)
+  - [Test Environment](#test-environment)
+  - [Steps](#steps)
+    - [Setup](#setup)
+    - [Run the tests](#run-the-tests)
+      - [Scale Listeners to Max of 64](#scale-listeners-to-max-of-64)
+      - [Scale HTTPS Listeners to Max of 64](#scale-https-listeners-to-max-of-64)
+      - [Scale HTTPRoutes](#scale-httproutes)
+      - [Scale Upstream Servers](#scale-upstream-servers)
+      - [Scale HTTP Matches](#scale-http-matches)
+    - [Analyze](#analyze)
+    - [Results](#results)
+<!-- TOC -->
+
+## Goals
+
+- Measure how NGF performs when the number of Gateway API and referenced core Kubernetes resources are scaled.
+- Test the following number of resources:
+  - Max number of HTTP and HTTPS Listeners (64)
+  - Max number of Upstream Servers (648)
+  - Max number of HTTPMatches
+  - 1000 HTTPRoutes
+
+## Test Environment
+
+For most of the tests, the following cluster will be sufficient:
+
+- A Kubernetes cluster with 4 nodes on GKE
+  - Node: n2d-standard-8 (8 vCPU, 32GB memory)
+  - Enabled GKE logging
+
+The Upstream Server scale test requires a bigger cluster to accommodate the large number of Pods. Those cluster details
+are listed in the [Scale Upstream Servers](#scale-upstream-servers) test steps.
+
+- 32 vCPUs
+- 128 GB total memory
+- us-west2-b
+- 1.27.5-gke.200
+
+## Steps
 
-- Create a GKE Cluster using the following details as a guide:
-  - 4 n2d-standard-8 nodes
-  - 32 vCPUs
-  - 128 GB total memory
-  - us-west2-b
-  - 1.27.5-gke.200
+### Setup
 
 - Install Gateway API Resources:
 
@@ -29,12 +67,16 @@
 - Install Prometheus:
 
   ```console
-  kubectl apply -f prom-clusterrole.yaml
+  kubectl apply -f manifets/prom-clusterrole.yaml
   helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
   helm repo update
   helm install prom prometheus-community/prometheus --set useExistingClusterRoleName=prometheus -n prom
   ```
 
+- Create a directory under [results](/tests/scale/results) and name it after the version of NGF you are testing. Then
+  create a file for the result summary, also named after the NGF version. For
+  example: [1.0.0.md](/tests/scale/results/1.0.0/1.0.0.md).
+
 ### Run the tests
 
 #### Scale Listeners to Max of 64
@@ -66,7 +108,7 @@ Follow the steps below to run the test:
   go test -v -tags scale -run TestScale_Listeners -i 64
   ```
 
-- [Collect and Record Metrics](#collecting-metrics).
+- [Analyze](#analyze) the results.
 
 - Clean up::
 
@@ -116,7 +158,7 @@ Follow the steps below to run the test:
   go test -v -tags scale -run TestScale_HTTPSListeners -i 64
   ```
 
-- [Collect and Record Metrics](#collecting-metrics).
+- [Analyze](#analyze) the results.
 
 - Clean up:
 
@@ -166,7 +208,7 @@ Follow the steps below to run the test:
   go test -v -tags scale -timeout 600m -run TestScale_HTTPRoutes -i 1000 -delay 2s
   ```
 
-- [Collect and Record Metrics](#collecting-metrics).
+- [Analyze](#analyze) the results.
 
 - Clean up:
 
@@ -201,14 +243,13 @@ Total Resources Created:
 - 1 HTTPRoutes
 - 1 Service, 1 Deployment, 648 Pods
 
-For this test you must use a much bigger cluster in order to create 648 Pods. Use the following cluster details as a
-guide:
+Test Environment:
 
-- 12 n2d-standard-16
-- 192 vCPUs
-- 768 GB total memory
-- us-west2-b
-- 1.27.6-gke.1248000
+For this test you must use a much bigger cluster in order to create 648 Pods.
+
+- A Kubernetes cluster with 12 nodes on GKE
+  - Node: n2d-standard-16 (16 vCPU, 64GB memory)
+  - Enabled GKE logging
 
 Follow the steps below to run the test:
 
@@ -225,8 +266,7 @@ Follow the steps below to run the test:
   kubectl describe httproute route
   ```
 
-- Get the start time as a UNIX timestamp and record it in the
-  results [summary](/tests/scale/results/summary.md#upstream-servers):
+- Get the start time as a UNIX timestamp and record it in the results.
 
   ```console
   date +%s
@@ -265,9 +305,9 @@ Follow the steps below to run the test:
   ```
 
 - In the terminal you started the request loop, kill the loop if it's still running and check the request.log to see if
-  any of the requests failed. Record any failures in the [summary](/tests/scale/results/summary.md#upstream-servers).
+  any of the requests failed. Record any failures in the results file.
 
-- [Collect and Record Metrics](#collecting-metrics). Use the start time and end time you made note of earlier for the
+- [Analyze](#analyze) the results. Use the start time and end time you made note of earlier for the
   queries. You can calculate the test duration in seconds by subtracting the start time from the end time.
 
 - Clean up:
@@ -327,15 +367,15 @@ Follow these steps to run the test:
    ./wrk -t2 -c10 -d30 http://cafe.example.com -H "header-50: header-50-val"
   ```
 
-- Copy and paste the results to the [summary](/tests/scale/results/summary.md#scale-http-matches).
+- Copy and paste the results into the results file.
 
 - Clean up::
 
   ```console
   kubectl delete -f manifests/scale-matches.yaml
   ```
 
-### Collecting Metrics
+### Analyze
 
 - Query Prometheus for reload metrics. To access the Prometheus Server, run:
 
@@ -349,7 +389,7 @@ Follow these steps to run the test:
 
   > Note:
   > For the tests that write to a csv file, the `Test Start`, `Test End + 10s`, and `Duration` are at the
-  > end of the results.csv file in the results/<NGF_VERSION/<TEST_NAME> directory.
+  > end of the results.csv file in the `results/<NGF_VERSION>/<TEST_NAME>` directory.
   > We are using `Test End + 10s` in the Prometheus query to account for the 10s scraping interval.
 
   Total number of reloads:
@@ -371,7 +411,7 @@ Follow these steps to run the test:
     rate(nginx_gateway_fabric_nginx_reloads_milliseconds_count[<Duration>] @ <Test End + 10s>)
     ```
 
-  Record these numbers in a table in the [results summary](/tests/scale/results/summary.md) doc.
+  Record these numbers in a table in the results file.
 
 - Take screenshots of memory and CPU usage in GKE Dashboard
 
@@ -380,7 +420,7 @@ Follow these steps to run the test:
 
   - Convert the `Start Time` and `End Time` UNIX timestamps to a date time using: https://www.epochconverter.com/.
   - Create a custom time frame for the graphs in GKE.
-  - Take a screenshot of the CPU and Memory graphs individually. Store them in the results/<NGF_VERSION>/<TEST_NAME>
+  - Take a screenshot of the CPU and Memory graphs individually. Store them in the `results/<NGF_VERSION>/<TEST_NAME>`
     directory.
 
 - If the test writes time to ready numbers to a csv, create a time to ready graph.
@@ -391,8 +431,11 @@ Follow these steps to run the test:
     - Set the Y axis to the Time to Ready column.
     - Set the X axis to the number of resources column.
     - Label the graph and take a screenshot.
-    - Store the graph in the results/<TEST_NAME> directory.
+    - Store the graph in the `results/<NGF_VERSION>/<TEST_NAME>` directory.
 
-- Check for errors or restarts and record in the [results summary](/tests/scale/results/summary.md) doc. File a bug if
-  there's unexpected errors or restarts.
+- Check for errors or restarts and record in the results file. File a bug if there's unexpected errors or restarts.
 - Check NGINX conf and make sure it looks correct. File a bug if there is an issue.
+
+### Results
+
+- [1.0.0](/tests/scale/results/1.0.0/1.0.0.md)