-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Bug Report
What did you do?
Install a HiveMQ Platform and HiveMQ Platform Operator Helm chart in a GKE Autopilot cluster.
What did you expect to see?
I expect a smooth reconciliation.
What did you see instead? Under which circumstances?
We see a constant mismatch of our StatefulSet resource, so it's updated on every reconciliation:
15:04:28.712 [INFO] c.h.p.o.d.StatefulSetResourceMatcher - Detected changes in StatefulSet specification:
Path: /spec/template/spec/containers/0/resources/limits/cpu
Actual value: "1"
Desired value: "1000m"
Path: /spec/template/spec/containers/0/resources/requests/cpu
Actual value: "1"
Desired value: "1000m"
(StatefulSetResourceMatcher extends SSABasedGenericKubernetesResourceMatcher and uses the internal, pruned actual and desired maps for the diff logging)
This mismatch should be prevented by the PodTemplateSpecSanitizer. The actual root cause for the mismatch is hidden, due to an unlucky configuration of resource requests/limits and the interference of GKE Autopilot:
-
The HiveMQ Platform Helm chart configures
cpurequests/limits of1000mthat will be serialized as1by K8s. So we require thePodTemplateSpecSanitizerin JOSDK to sanitize theactualMap, to prevent false positive mismatches on ourStatefulSetresource. -
The HiveMQ Platform Helm chart doesn't configure
ephemeral-storagerequests/limits by default, but GKE Autopilot enforces this and updates ourStatefulSetaccordingly on-the-fly.
Under the hood we end up with these values in the matcher:
desired:
resources:
limits:
cpu: 1000m
memory: 2048M
requests:
cpu: 1000m
memory: 2048M
actual:
resources:
limits:
cpu: 1 # changed by K8s
ephemeral-storage: 1Gi # added by GKE Autopilot
memory: 2048M
requests:
cpu: 1 # changed by K8s
ephemeral-storage: 1Gi # added by GKE Autopilot
memory: 2048MThe size mismatch of the actual and desired maps trigger this early return in PodTemplateSpecSanitizer. So the cpu values are not sanitized and we end up with a false positive mismatch of the StatefulSet.
Since the desired state doesn't contain ephemeral-storage, there are no managed fields for this key in the requests/limits resources of our container. The SSABasedGenericKubernetesResourceMatcher then correctly prunes ephemeral-storage from the actual map, but also hides it as the actual root cause for the wrong cpu mismatch. For example, even with debug logging the ephemeral-storage won't show up in the diff, because that uses the pruned actual map: var diff = getDiff(prunedActual, desiredMap, objectMapper);. The same applies to our custom logging, that also uses the pruned actual map.
Environment
Kubernetes cluster type: K8s 1.33.5 on GKE with Autopilot
$ Mention java-operator-sdk version from pom.xml file
5.1.4
$ java -version
openjdk version "21.0.8" 2025-07-15
OpenJDK Runtime Environment (build 21.0.8+9-Ubuntu-0ubuntu124.04.1)
OpenJDK 64-Bit Server VM (build 21.0.8+9-Ubuntu-0ubuntu124.04.1, mixed mode, sharing)
$ kubectl version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-gke.1080000
Possible Solution
The easiest solution would be to remove the early return: .filter(m -> m.size() == desiredResource.size()).
This shouldn't cost much performance, since we still have two more early returns before we call equals() check that invokes the expensive getNumericalAmount().