-
Notifications
You must be signed in to change notification settings - Fork 131
feat: add option to enable ApiTracer #3095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds an option to enable an ApiTracer for the client. An ApiTracer will add traces for all RPCs that are being executed. The traces will only be exported if an OpenTelemetry or OpenCensus trace exporter has been configured. An ApiTracer adds additional detail information about the time that a single RPC took. It can also be used to determine whether an RPC was retried or not.
…a-spanner into add-option-for-api-tracer
google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracerFactory.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/OpenTelemetryApiTracer.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void lroStartFailed(Throwable error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these methods are not currently used by anyone. For example, neither Bigtable's MetricsTracer or Gax's MetricsTracer overrides lroStartFailed
.
In general any method starts with attempt
or operation
should be well tested, others like lroStartFailed
, lroStartSucceeded
, connectionSelected
, batchRequestSent
should still mostly work, but I would use them with caution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called from here: https://github.com/googleapis/sdk-platform-java/blob/5799827a864be627bac03969cc178efc9d6170aa/gax-java/gax/src/main/java/com/google/api/gax/tracing/TracedOperationInitialCallable.java#L85
This test case covers the kind of failure that would trigger this method to be called:
Line 346 in bfa1aac
public void testLroCreationFailed() { |
I agree that it is relatively unlikely to happen in production, as it requires an RPC that would normally return an LRO to fail in a way that does not return an LRO, but instead just returns an error. But it is needed to implement the interface, and in theory it could happen (I could for example imagine an RPC like this returning a RESOURCE_EXHAUSTED error directly, instead of an LRO, if there is a limit on the number of operations running in parallel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is needed to implement the interface
All methods in ApiTracer
are default no-op methods now, you don't have to override them.
What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed
should be called in places more than just in TracedOperationInitialCallable
, and we did find a few small bugs related to the whole ApiTracer
framework while implementing OpenTelemetry metrics in Gax.
That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep this, because:
- Tracing long-running operations adds value for Spanner, specifically because DDL statements are executed as long-running operations. Giving customers (and us) insights into this can help in debugging issues, especially if the customer uses statements like
create table if not exists...
, which are relatively quick to execute, but still add significant latency if they are executed as part of for example the standard application startup. - This method covers one of the potential end states of such an LRO. Failing to implement that method, would leave the trace open.
- If there are specific cases that do not call this method, then that will also leave the trace open, but getting the trace closed in most cases is still better than not implementing this method. The 'standard' way that this method should be called is covered by a test case, so I feel confident that the most common code flow that should be covered by this method works as intended.
if (attemptNumber > 0 && operationType != OperationType.LongRunning) { | ||
// Add an event if the RPC retries, as this is otherwise transparent to the user. Retries | ||
// would then show up as higher latency without any logical explanation. | ||
span.addEvent("Starting RPC retry " + attemptNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered representing retries as child spans? It would definitely make things complicated but maybe easier for customers to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not considered it before implementing this. But also after having thought about it for a while, I don't think we should do that. The reason is that:
- We would need to make a choice between 'adding a span for each attempt (including the initial attempt)' and 'adding an extra span only for retry attempts'.
- The first would mean that we would add an extra span that does not really add any information in the vast majority of the cases when the RPC is not being retried. That would only add additional costs for customers.
- The second would mean that we would get a varying 'span depth' depending on whether an RPC is being retried or not. I'm not sure that makes it easier to understand for a customer, and I'm worried that it makes it harder for a customer to search for slow requests. With this, a slow request could show up both as a single slow attempt, and as a slow operation with multiple child spans, where the attributes would be spread across the parent and child spans.
* {@link com.google.api.gax.tracing.ApiTracer} for use with OpenTelemetry. Based on {@link | ||
* com.google.api.gax.tracing.OpencensusTracer}. | ||
*/ | ||
class OpenTelemetryApiTracer implements ApiTracer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to separate the logic to calculate the metrics and the logic to record the metrics while implementing metrics in Gax. Hence MetricsTracer and MetricsRecorder are generic classes without any knowledge of OpenTelemetry, only OpenTelemetryMetricsRecorder is dependent on OpenTelemetry apis.
Spanner does not have to follow the same design principle and it may not be feasible for traces either, but something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spanner client supports tracing with both OpenCensus and OpenTelemetry. The customer makes a choice at startup, and then the client library configures itself accordingly. We want to support the same for the ApiTracer
, so based on that, it would make sense to create a technology-agnostic ApiTracer
. However, there is already an OpencensusTracer in gax that can be used with Spanner today (and that we need to continue to support). That means that it is easier to create an OpenTelemetry
sibling of this tracer, and choose one or the other based on the user configuration when the client is created, instead of adding another layer of abstraction between the client library and the ApiTracer
.
return MoreObjects.firstNonNull(super.getApiTracerFactory(), getDefaultApiTracerFactory()); | ||
} | ||
|
||
private ApiTracerFactory getDefaultApiTracerFactory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably out of scope for this PR, once you start using Gax metrics, you need to create a CompositeTracerFactory that includes both OpenTelemetryApiTracerFactory
and MetricsTracerFactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link. (And indeed for now out-of-scope, the immediate requirement is more RPC tracing, which will help customers understand why a given transaction takes X time.)
* com.google.api.gax.tracing.OpencensusTracer}. | ||
*/ | ||
class OpenTelemetryApiTracer implements ApiTracer { | ||
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to document that these attribute keys are subject to change? If we are going to implement them in Gax, they may have different naming conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment to the class, but I don't think many people will actually read it. I'm also not sure it will be a very big problem for customers if they change in the future. I also added a comment regarding this to the README file.
The keys are by the way taken from the existing OpencensusTracer
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a comment regarding this to the README file.
Thanks, this sounds good to me!
* com.google.api.gax.tracing.OpencensusTracer}. | ||
*/ | ||
class OpenTelemetryApiTracer implements ApiTracer { | ||
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a comment regarding this to the README file.
Thanks, this sounds good to me!
} | ||
|
||
@Override | ||
public void lroStartFailed(Throwable error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is needed to implement the interface
All methods in ApiTracer
are default no-op methods now, you don't have to override them.
What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed
should be called in places more than just in TracedOperationInitialCallable
, and we did find a few small bugs related to the whole ApiTracer
framework while implementing OpenTelemetry metrics in Gax.
That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.
@@ -272,6 +272,22 @@ SpannerOptions options = SpannerOptions.newBuilder() | |||
This option can also be enabled by setting the environment variable | |||
`SPANNER_ENABLE_EXTENDED_TRACING=true`. | |||
|
|||
#### OpenTelemetry API Tracing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we consider this a beta feature or a stable feature? If we consider this a beta feature, I would add @BetaApi
to setEnableApiTracing
and mention it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any specific reason for why we would not consider this a stable feature, other than the comments already mentioned regarding the possiblity of changing attribute keys. In many ways, this is just an extension of something that was already supported (ApiTracer
with OpenCensus) using the newer tracing framework, that is also considered a stable feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from Gax's point of view.
…a-spanner into add-option-for-api-tracer
Adds an option to enable an ApiTracer for the client. An ApiTracer will add traces for all RPCs that are being executed. The traces will only be exported if an OpenTelemetry or OpenCensus trace exporter has been configured.
An ApiTracer adds additional detail information about the time that a single RPC took. It can also be used to determine whether an RPC was retried or not.