feat: add option to enable ApiTracer #3095

olavloite · 2024-05-06T11:43:48Z

Adds an option to enable an ApiTracer for the client. An ApiTracer will add traces for all RPCs that are being executed. The traces will only be exported if an OpenTelemetry or OpenCensus trace exporter has been configured.
An ApiTracer adds additional detail information about the time that a single RPC took. It can also be used to determine whether an RPC was retried or not.

Adds an option to enable an ApiTracer for the client. An ApiTracer will add traces for all RPCs that are being executed. The traces will only be exported if an OpenTelemetry or OpenCensus trace exporter has been configured. An ApiTracer adds additional detail information about the time that a single RPC took. It can also be used to determine whether an RPC was retried or not.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

…a-spanner into add-option-for-api-tracer

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracerFactory.java

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

blakeli0 · 2024-05-18T05:50:07Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+  }
+
+  @Override
+  public void lroStartFailed(Throwable error) {


Some of these methods are not currently used by anyone. For example, neither Bigtable's MetricsTracer or Gax's MetricsTracer overrides lroStartFailed.
In general any method starts with attempt or operation should be well tested, others like lroStartFailed, lroStartSucceeded, connectionSelected, batchRequestSent should still mostly work, but I would use them with caution.

It's called from here: https://github.com/googleapis/sdk-platform-java/blob/5799827a864be627bac03969cc178efc9d6170aa/gax-java/gax/src/main/java/com/google/api/gax/tracing/TracedOperationInitialCallable.java#L85

This test case covers the kind of failure that would trigger this method to be called:

java-spanner/google-cloud-spanner/src/test/java/com/google/cloud/spanner/OpenTelemetryApiTracerTest.java

Line 346 in bfa1aac

public void testLroCreationFailed() {

I agree that it is relatively unlikely to happen in production, as it requires an RPC that would normally return an LRO to fail in a way that does not return an LRO, but instead just returns an error. But it is needed to implement the interface, and in theory it could happen (I could for example imagine an RPC like this returning a RESOURCE_EXHAUSTED error directly, instead of an LRO, if there is a limit on the number of operations running in parallel).

it is needed to implement the interface

All methods in ApiTracer are default no-op methods now, you don't have to override them.

What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed should be called in places more than just in TracedOperationInitialCallable, and we did find a few small bugs related to the whole ApiTracer framework while implementing OpenTelemetry metrics in Gax.

That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.

I would prefer to keep this, because:

Tracing long-running operations adds value for Spanner, specifically because DDL statements are executed as long-running operations. Giving customers (and us) insights into this can help in debugging issues, especially if the customer uses statements like create table if not exists..., which are relatively quick to execute, but still add significant latency if they are executed as part of for example the standard application startup.

This method covers one of the potential end states of such an LRO. Failing to implement that method, would leave the trace open.

If there are specific cases that do not call this method, then that will also leave the trace open, but getting the trace closed in most cases is still better than not implementing this method. The 'standard' way that this method should be called is covered by a test case, so I feel confident that the most common code flow that should be covered by this method works as intended.

blakeli0 · 2024-05-18T06:04:07Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+    if (attemptNumber > 0 && operationType != OperationType.LongRunning) {
+      // Add an event if the RPC retries, as this is otherwise transparent to the user. Retries
+      // would then show up as higher latency without any logical explanation.
+      span.addEvent("Starting RPC retry " + attemptNumber);


Have we considered representing retries as child spans? It would definitely make things complicated but maybe easier for customers to understand.

I had not considered it before implementing this. But also after having thought about it for a while, I don't think we should do that. The reason is that:

We would need to make a choice between 'adding a span for each attempt (including the initial attempt)' and 'adding an extra span only for retry attempts'.

The first would mean that we would add an extra span that does not really add any information in the vast majority of the cases when the RPC is not being retried. That would only add additional costs for customers.

The second would mean that we would get a varying 'span depth' depending on whether an RPC is being retried or not. I'm not sure that makes it easier to understand for a customer, and I'm worried that it makes it harder for a customer to search for slow requests. With this, a slow request could show up both as a single slow attempt, and as a slow operation with multiple child spans, where the attributes would be spread across the parent and child spans.

blakeli0 · 2024-05-18T06:19:57Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+ * {@link com.google.api.gax.tracing.ApiTracer} for use with OpenTelemetry. Based on {@link
+ * com.google.api.gax.tracing.OpencensusTracer}.
+ */
+class OpenTelemetryApiTracer implements ApiTracer {


We decided to separate the logic to calculate the metrics and the logic to record the metrics while implementing metrics in Gax. Hence MetricsTracer and MetricsRecorder are generic classes without any knowledge of OpenTelemetry, only OpenTelemetryMetricsRecorder is dependent on OpenTelemetry apis.
Spanner does not have to follow the same design principle and it may not be feasible for traces either, but something to consider.

The Spanner client supports tracing with both OpenCensus and OpenTelemetry. The customer makes a choice at startup, and then the client library configures itself accordingly. We want to support the same for the ApiTracer, so based on that, it would make sense to create a technology-agnostic ApiTracer. However, there is already an OpencensusTracer in gax that can be used with Spanner today (and that we need to continue to support). That means that it is easier to create an OpenTelemetry sibling of this tracer, and choose one or the other based on the user configuration when the client is created, instead of adding another layer of abstraction between the client library and the ApiTracer.

blakeli0 · 2024-05-18T06:22:14Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java

+    return MoreObjects.firstNonNull(super.getApiTracerFactory(), getDefaultApiTracerFactory());
+  }
+
+  private ApiTracerFactory getDefaultApiTracerFactory() {


Probably out of scope for this PR, once you start using Gax metrics, you need to create a CompositeTracerFactory that includes both OpenTelemetryApiTracerFactory and MetricsTracerFactory.

Thanks for the link. (And indeed for now out-of-scope, the immediate requirement is more RPC tracing, which will help customers understand why a given transaction takes X time.)

blakeli0 · 2024-05-18T06:24:51Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+ * com.google.api.gax.tracing.OpencensusTracer}.
+ */
+class OpenTelemetryApiTracer implements ApiTracer {
+  private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count");


Is it possible to document that these attribute keys are subject to change? If we are going to implement them in Gax, they may have different naming conventions.

I've added a comment to the class, but I don't think many people will actually read it. I'm also not sure it will be a very big problem for customers if they change in the future. I also added a comment regarding this to the README file.

The keys are by the way taken from the existing OpencensusTracer implementation.

I also added a comment regarding this to the README file.

Thanks, this sounds good to me!

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

blakeli0 · 2024-05-23T15:14:45Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+ * com.google.api.gax.tracing.OpencensusTracer}.
+ */
+class OpenTelemetryApiTracer implements ApiTracer {
+  private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count");


I also added a comment regarding this to the README file.

Thanks, this sounds good to me!

blakeli0 · 2024-05-23T15:41:12Z

google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java

+  }
+
+  @Override
+  public void lroStartFailed(Throwable error) {


it is needed to implement the interface

All methods in ApiTracer are default no-op methods now, you don't have to override them.

What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed should be called in places more than just in TracedOperationInitialCallable, and we did find a few small bugs related to the whole ApiTracer framework while implementing OpenTelemetry metrics in Gax.

That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.

blakeli0 · 2024-05-23T15:47:37Z

README.md

@@ -272,6 +272,22 @@ SpannerOptions options = SpannerOptions.newBuilder()
 This option can also be enabled by setting the environment variable
 `SPANNER_ENABLE_EXTENDED_TRACING=true`.

+#### OpenTelemetry API Tracing


Do we consider this a beta feature or a stable feature? If we consider this a beta feature, I would add @BetaApi to setEnableApiTracing and mention it here.

I don't see any specific reason for why we would not consider this a stable feature, other than the comments already mentioned regarding the possiblity of changing attribute keys. In many ways, this is just an extension of something that was already supported (ApiTracer with OpenCensus) using the newer tracing framework, that is also considered a stable feature.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

blakeli0

LGTM from Gax's point of view.

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

…a-spanner into add-option-for-api-tracer

product-auto-label bot added size: l Pull request size is large. api: spanner Issues related to the googleapis/java-spanner API. labels May 6, 2024

olavloite added 3 commits May 8, 2024 13:13

feat: add gfe-latency as attribute to span

62a5492

fix: get span at request start

990850b

test: add tests for ApiTracer

6958173

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels May 14, 2024

olavloite marked this pull request as ready for review May 14, 2024 17:39

olavloite requested a review from a team as a code owner May 14, 2024 17:39

olavloite and others added 2 commits May 14, 2024 19:50

test: fix tests when using mux sessions

319e889

🦉 Updates from OwlBot post-processor

ce205a5

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

olavloite requested a review from a team as a code owner May 14, 2024 17:52

olavloite added 2 commits May 14, 2024 19:57

chore: cleanup

90752a1

Merge branch 'add-option-for-api-tracer' of github.com:googleapis/jav…

ab5e2dd

…a-spanner into add-option-for-api-tracer

olavloite requested review from surbhigarg92 and burkedavison May 14, 2024 17:58

olavloite added 5 commits May 14, 2024 20:16

build: declare test dependency that is being used

644e3d3

Merge branch 'main' into add-option-for-api-tracer

e03f21b

chore: remove TODOs

3392074

chore: add env var for enabling api tracing

1510a12

chore: add clirr ignored diff

bfa1aac

blakeli0 reviewed May 18, 2024

View reviewed changes

olavloite added 2 commits May 21, 2024 10:25

chore: address review comments

aa2d4da

Merge branch 'main' into add-option-for-api-tracer

fc7c7f0

olavloite requested a review from blakeli0 May 21, 2024 08:32

🦉 Updates from OwlBot post-processor

2318515

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

blakeli0 reviewed May 23, 2024

View reviewed changes

olavloite and others added 2 commits May 29, 2024 09:58

Merge branch 'main' into add-option-for-api-tracer

dc9d284

🦉 Updates from OwlBot post-processor

c40f434

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

olavloite requested a review from blakeli0 May 29, 2024 08:04

blakeli0 approved these changes Jun 3, 2024

View reviewed changes

surbhigarg92 approved these changes Jun 4, 2024

View reviewed changes

olavloite and others added 8 commits June 5, 2024 15:22

Merge branch 'main' into add-option-for-api-tracer

b6e9018

test: add OpenCensus ApiTracer tests

2081efd

🦉 Updates from OwlBot post-processor

2e2798c

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

fix: clear spans after test

3b51a06

Merge branch 'add-option-for-api-tracer' of github.com:googleapis/jav…

2ed8d2a

…a-spanner into add-option-for-api-tracer

fix: ignore test traces coming from a previous test

1107a04

test: skip OpenCensus ApiTracer test as OpenCensus is way too intrusive

715f209

test: increase timeout to reduce flakiness

52dfd06

olavloite merged commit a0a4bc5 into main Jun 7, 2024
33 of 34 checks passed

olavloite deleted the add-option-for-api-tracer branch June 7, 2024 04:54

release-please bot mentioned this pull request Jun 7, 2024

chore(main): release 6.69.0 #3146

Merged

olavloite mentioned this pull request Jun 10, 2024

ITEmulatorConcurrentTransactionsTest is flaky with multiplexed sessions #3131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add option to enable ApiTracer #3095

feat: add option to enable ApiTracer #3095

olavloite commented May 6, 2024

blakeli0 May 18, 2024

olavloite May 21, 2024

blakeli0 May 23, 2024

olavloite May 29, 2024

blakeli0 May 18, 2024

olavloite May 21, 2024

blakeli0 May 18, 2024

olavloite May 21, 2024

blakeli0 May 18, 2024

olavloite May 21, 2024

blakeli0 May 18, 2024

olavloite May 21, 2024

blakeli0 May 23, 2024

blakeli0 May 23, 2024

blakeli0 May 23, 2024

blakeli0 May 23, 2024

olavloite May 29, 2024

blakeli0 left a comment

feat: add option to enable ApiTracer #3095

feat: add option to enable ApiTracer #3095

Conversation

olavloite commented May 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakeli0 left a comment

Choose a reason for hiding this comment