Skip to content

Don't not send any telemetry data if telemetry collection fails #1729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pleshakov opened this issue Mar 19, 2024 · 0 comments · Fixed by #1731
Closed

Don't not send any telemetry data if telemetry collection fails #1729

pleshakov opened this issue Mar 19, 2024 · 0 comments · Fixed by #1731
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pleshakov
Copy link
Contributor

Describe the bug
When NGF fails to collect product telemetry, it sends empty telemetry data.

To Reproduce

cd tests
make create-kind-cluster
make build-images load-images TAG=$(whoami) TELEMETRY_ENDPOINT=otel-collector-opentelemetry-collector.collector.svc.cluster.local:4317 TELEMETRY_ENDPOINT_INSECURE=true
helm install otel-collector open-telemetry/opentelemetry-collector -f suite/manifests/telemetry/collector-values.yaml -n collector --create-namespace

Deploy NGF:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml

cd ..

helm install my-release ./deploy/helm-chart --create-namespace --wait --set service.type=NodePort --set nginxGateway.image.repository=nginx-gateway-fabric --set nginxGateway.image.tag=$(whoami) --set nginxGateway.image.pullPolicy=Never --set nginx.image.repository=nginx-gateway-fabric/nginx --set nginx.image.tag=$(whoami) --set nginx.image.pullPolicy=Never -n nginx-gateway

Edit NGF cluster role - remove rbac to list nodes:

kubectl edit clusterrole  my-release-nginx-gateway-fabric

Remove:

 44 - apiGroups:
 45   - ""
 46   resources:
 47   - nodes
 48   verbs:
 49   - list

Delete NGF pod to re-create a new one:

kubectl -n nginx-gateway delete pod <pod-name>

Look at NGF pod logs, it should fail to collect telemetry because of RBAC changes:

{"level":"error","ts":"2024-03-19T21:21:12Z","logger":"telemetryJob","msg":"Failed to collect telemetry data","error":"failed to collect cluster information: failed to get NodeList: nodes is forbidden: User \"system:serviceaccount:nginx-gateway:my-release-nginx-gateway-fabric\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.createTelemetryJob.CreateTelemetryJobWorker.func4\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/mode/static/telemetry/job_worker.go:29\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\tk8s.io/[email protected]/pkg/util/wait/backoff.go:259\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables.(*CronJob).Start\n\tgithub.com/nginxinc/nginx-gateway-fabric/internal/framework/runnables/cronjob.go:53\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\tsigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}

Look at the collector logs:

kubectl -n collector logs <otel-collector-pod-name> | grep "dataType: Str(ngf-product-telemetry)" -A 19
     -> dataType: Str(ngf-product-telemetry)
     -> ImageSource: Str(local)
     -> ProjectName: Str(NGF)
     -> ProjectVersion: Str(edge)
     -> ProjectArchitecture: Str(amd64)
     -> ClusterID: Str(ced72774-ef05-403c-9a91-2acffc9c386f)
     -> ClusterVersion: Str(1.29.2)
     -> ClusterPlatform: Str(kind)
     -> InstallationID: Str(43a0a1be-919c-417b-b85e-782adb1e3f39)
     -> ClusterNodeCount: Int(1)
     -> FlagNames: Slice(["config","gateway","gateway-api-experimental-features","gateway-ctlr-name","gatewayclass","health-disable","health-port","help","leader-election-disable","leader-election-lock-name","metrics-disable","metrics-port","metrics-secure-serving","nginx-plus","product-telemetry-disable","service","update-gatewayclass-status","usage-report-cluster-name","usage-report-secret","usage-report-server-url","usage-report-skip-verify"])
     -> FlagValues: Slice(["user-defined","default","false","user-defined","user-defined","false","default","false","false","user-defined","false","default","false","false","false","user-defined","true","default","default","default","false"])
     -> GatewayCount: Int(0)
     -> GatewayClassCount: Int(1)
     -> HTTPRouteCount: Int(0)
     -> SecretCount: Int(0)
     -> ServiceCount: Int(0)
     -> EndpointCount: Int(0)
     -> NGFReplicaCount: Int(1)
        {"kind": "exporter", "data_type": "traces", "name": "debug"}
--
     -> dataType: Str(ngf-product-telemetry)
     -> ImageSource: Str()
     -> ProjectName: Str()
     -> ProjectVersion: Str()
     -> ProjectArchitecture: Str()
     -> ClusterID: Str()
     -> ClusterVersion: Str()
     -> ClusterPlatform: Str()
     -> InstallationID: Str()
     -> ClusterNodeCount: Int(0)
     -> FlagNames: Slice([])
     -> FlagValues: Slice([])
     -> GatewayCount: Int(0)
     -> GatewayClassCount: Int(0)
     -> HTTPRouteCount: Int(0)
     -> SecretCount: Int(0)
     -> ServiceCount: Int(0)
     -> EndpointCount: Int(0)
     -> NGFReplicaCount: Int(0)
        {"kind": "exporter", "data_type": "traces", "name": "debug"}

Note how the second report from the new pod sends empty data.

Expected behavior

In case of error on collection, NGF should send any telemetry.

Your environment

  • NGF Edge version

Additional context
Add any other context about the problem here. Any log files you want to share.

@pleshakov pleshakov added the bug Something isn't working label Mar 19, 2024
@sindhushiv sindhushiv added this to the v1.2.0 milestone Mar 19, 2024
@bjee19 bjee19 self-assigned this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants