Skip to content

feat: add option to enable ApiTracer #3095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 7, 2024
Merged

Conversation

olavloite
Copy link
Collaborator

Adds an option to enable an ApiTracer for the client. An ApiTracer will add traces for all RPCs that are being executed. The traces will only be exported if an OpenTelemetry or OpenCensus trace exporter has been configured.
An ApiTracer adds additional detail information about the time that a single RPC took. It can also be used to determine whether an RPC was retried or not.

Adds an option to enable an ApiTracer for the client. An ApiTracer will
add traces for all RPCs that are being executed. The traces will only
be exported if an OpenTelemetry or OpenCensus trace exporter has been
configured.
An ApiTracer adds additional detail information about the time that a
single RPC took. It can also be used to determine whether an RPC was
retried or not.
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: spanner Issues related to the googleapis/java-spanner API. labels May 6, 2024
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels May 14, 2024
@olavloite olavloite marked this pull request as ready for review May 14, 2024 17:39
@olavloite olavloite requested a review from a team as a code owner May 14, 2024 17:39
@olavloite olavloite requested a review from a team as a code owner May 14, 2024 17:52
}

@Override
public void lroStartFailed(Throwable error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these methods are not currently used by anyone. For example, neither Bigtable's MetricsTracer or Gax's MetricsTracer overrides lroStartFailed.
In general any method starts with attempt or operation should be well tested, others like lroStartFailed, lroStartSucceeded, connectionSelected, batchRequestSent should still mostly work, but I would use them with caution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called from here: https://github.com/googleapis/sdk-platform-java/blob/5799827a864be627bac03969cc178efc9d6170aa/gax-java/gax/src/main/java/com/google/api/gax/tracing/TracedOperationInitialCallable.java#L85

This test case covers the kind of failure that would trigger this method to be called:

I agree that it is relatively unlikely to happen in production, as it requires an RPC that would normally return an LRO to fail in a way that does not return an LRO, but instead just returns an error. But it is needed to implement the interface, and in theory it could happen (I could for example imagine an RPC like this returning a RESOURCE_EXHAUSTED error directly, instead of an LRO, if there is a limit on the number of operations running in parallel).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed to implement the interface

All methods in ApiTracer are default no-op methods now, you don't have to override them.

What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed should be called in places more than just in TracedOperationInitialCallable, and we did find a few small bugs related to the whole ApiTracer framework while implementing OpenTelemetry metrics in Gax.

That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep this, because:

  1. Tracing long-running operations adds value for Spanner, specifically because DDL statements are executed as long-running operations. Giving customers (and us) insights into this can help in debugging issues, especially if the customer uses statements like create table if not exists..., which are relatively quick to execute, but still add significant latency if they are executed as part of for example the standard application startup.
  2. This method covers one of the potential end states of such an LRO. Failing to implement that method, would leave the trace open.
  3. If there are specific cases that do not call this method, then that will also leave the trace open, but getting the trace closed in most cases is still better than not implementing this method. The 'standard' way that this method should be called is covered by a test case, so I feel confident that the most common code flow that should be covered by this method works as intended.

if (attemptNumber > 0 && operationType != OperationType.LongRunning) {
// Add an event if the RPC retries, as this is otherwise transparent to the user. Retries
// would then show up as higher latency without any logical explanation.
span.addEvent("Starting RPC retry " + attemptNumber);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered representing retries as child spans? It would definitely make things complicated but maybe easier for customers to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not considered it before implementing this. But also after having thought about it for a while, I don't think we should do that. The reason is that:

  1. We would need to make a choice between 'adding a span for each attempt (including the initial attempt)' and 'adding an extra span only for retry attempts'.
  2. The first would mean that we would add an extra span that does not really add any information in the vast majority of the cases when the RPC is not being retried. That would only add additional costs for customers.
  3. The second would mean that we would get a varying 'span depth' depending on whether an RPC is being retried or not. I'm not sure that makes it easier to understand for a customer, and I'm worried that it makes it harder for a customer to search for slow requests. With this, a slow request could show up both as a single slow attempt, and as a slow operation with multiple child spans, where the attributes would be spread across the parent and child spans.

* {@link com.google.api.gax.tracing.ApiTracer} for use with OpenTelemetry. Based on {@link
* com.google.api.gax.tracing.OpencensusTracer}.
*/
class OpenTelemetryApiTracer implements ApiTracer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to separate the logic to calculate the metrics and the logic to record the metrics while implementing metrics in Gax. Hence MetricsTracer and MetricsRecorder are generic classes without any knowledge of OpenTelemetry, only OpenTelemetryMetricsRecorder is dependent on OpenTelemetry apis.
Spanner does not have to follow the same design principle and it may not be feasible for traces either, but something to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spanner client supports tracing with both OpenCensus and OpenTelemetry. The customer makes a choice at startup, and then the client library configures itself accordingly. We want to support the same for the ApiTracer, so based on that, it would make sense to create a technology-agnostic ApiTracer. However, there is already an OpencensusTracer in gax that can be used with Spanner today (and that we need to continue to support). That means that it is easier to create an OpenTelemetry sibling of this tracer, and choose one or the other based on the user configuration when the client is created, instead of adding another layer of abstraction between the client library and the ApiTracer.

return MoreObjects.firstNonNull(super.getApiTracerFactory(), getDefaultApiTracerFactory());
}

private ApiTracerFactory getDefaultApiTracerFactory() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope for this PR, once you start using Gax metrics, you need to create a CompositeTracerFactory that includes both OpenTelemetryApiTracerFactory and MetricsTracerFactory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. (And indeed for now out-of-scope, the immediate requirement is more RPC tracing, which will help customers understand why a given transaction takes X time.)

* com.google.api.gax.tracing.OpencensusTracer}.
*/
class OpenTelemetryApiTracer implements ApiTracer {
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to document that these attribute keys are subject to change? If we are going to implement them in Gax, they may have different naming conventions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to the class, but I don't think many people will actually read it. I'm also not sure it will be a very big problem for customers if they change in the future. I also added a comment regarding this to the README file.

The keys are by the way taken from the existing OpencensusTracer implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment regarding this to the README file.

Thanks, this sounds good to me!

@olavloite olavloite requested a review from blakeli0 May 21, 2024 08:32
* com.google.api.gax.tracing.OpencensusTracer}.
*/
class OpenTelemetryApiTracer implements ApiTracer {
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment regarding this to the README file.

Thanks, this sounds good to me!

}

@Override
public void lroStartFailed(Throwable error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed to implement the interface

All methods in ApiTracer are default no-op methods now, you don't have to override them.

What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed should be called in places more than just in TracedOperationInitialCallable, and we did find a few small bugs related to the whole ApiTracer framework while implementing OpenTelemetry metrics in Gax.

That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.

@@ -272,6 +272,22 @@ SpannerOptions options = SpannerOptions.newBuilder()
This option can also be enabled by setting the environment variable
`SPANNER_ENABLE_EXTENDED_TRACING=true`.

#### OpenTelemetry API Tracing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider this a beta feature or a stable feature? If we consider this a beta feature, I would add @BetaApi to setEnableApiTracing and mention it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any specific reason for why we would not consider this a stable feature, other than the comments already mentioned regarding the possiblity of changing attribute keys. In many ways, this is just an extension of something that was already supported (ApiTracer with OpenCensus) using the newer tracing framework, that is also considered a stable feature.

@olavloite olavloite requested a review from blakeli0 May 29, 2024 08:04
Copy link
Contributor

@blakeli0 blakeli0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from Gax's point of view.

@olavloite olavloite merged commit a0a4bc5 into main Jun 7, 2024
33 of 34 checks passed
@olavloite olavloite deleted the add-option-for-api-tracer branch June 7, 2024 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/java-spanner API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants