-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add GCS telemtry with ThreadLocal #125452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @mhl-b, I've created a changelog YAML for you. |
6983065
to
b5da931
Compare
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @mhl-b, I've created a changelog YAML for you. |
…rch into gcp-metrics-threadlocal
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java
Show resolved
Hide resolved
...pository-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageService.java
Outdated
Show resolved
Hide resolved
static OperationStats initAndGet(OperationPurpose purpose, StorageOperation operation) { | ||
var stats = new OperationStats(purpose, operation); | ||
OPERATION_STATS.set(stats); | ||
return stats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert here that we're not over-writing an existing OperationStats? it seems like if things are working properly we should never do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And below in set perhaps?
public Map<String, BlobStoreActionStats> operationsStats() { | ||
return restMetering.entrySet().stream().collect(Collectors.toMap(e -> e.getKey().key, e -> { | ||
var ops = e.getValue().sum(); | ||
return new BlobStoreActionStats(ops, ops); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be ops and requests (ops/ops) was just a stop-gap until we had the ability to count requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking BlobStoreActionStats is not practical. REST metering tracks operation counts only and APM structure does not fit in it, the error metrics in particular. I think we can remove BlobStoreActionStats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example record BillableRequests(long count){}
and Map<Operation,BillableRequests>
. That would be more accurate representation for REST API metering endpoint.
* Continue collecting metrics with given OperationStats. Useful for readers and writers. | ||
*/ | ||
public void continueAndCollect(OperationStats stats, CheckedRunnable<IOException> runnable) throws IOException { | ||
OperationStats.set(stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be more assertive about the current state of OperationStats
. In this case it's probably safe to clear it after each action right? then we could assert that it was empty each time we went to start a new action, to make it fail noisily if they started being interleaved somehow?
...ory-gcs/src/test/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageServiceTests.java
Show resolved
Hide resolved
* Create the map of attributes we expect to see on repository metrics | ||
*/ | ||
public static Map<String, Object> createAttributesMap(String repoType, String repoName, OperationPurpose purpose, String operation) { | ||
return Map.of("repo_type", repoType, "repo_name", repoName, "operation", operation, "purpose", purpose.getKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the additional layer of indirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, will inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach much better, a few comments
@nicktindall , it's ready for another review. I havn't done APM testing yet, PR is already large. Will do a follow up PR. |
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I have some queries around the counting logic and other minor stuff
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
@@ -34,14 +35,15 @@ public class GoogleCloudStoragePlugin extends Plugin implements RepositoryPlugin | |||
|
|||
@SuppressWarnings("this-escape") | |||
public GoogleCloudStoragePlugin(final Settings settings) { | |||
this.storageService = createStorageService(settings); | |||
var isServerless = settings.getAsBoolean(DiscoveryNode.STATELESS_ENABLED_SETTING_NAME, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a helper for this org.elasticsearch.cluster.node.DiscoveryNode#isStateless
, there is some sentiment that its use shouldn't be expanded but it just does what you're doing here.
I think we're supposed to prefer plugins/SPI to having explicit flag checks like this (maybe something we could consider later)
...pository-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageService.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GcsRepositoryStatsCollector.java
Outdated
Show resolved
Hide resolved
trackRequest(StorageOperation.LIST.key()); | ||
} else if (Regex.simpleMatch("POST /upload/storage/v1/b/*uploadType=resumable*", request)) { | ||
trackRequest(StorageOperation.INSERT.key()); | ||
} else if (Regex.simpleMatch("PUT /upload/storage/v1/b/*uploadType=resumable*", request)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count all requests now, not operations
@Override | ||
public void testRequestStats() throws Exception { | ||
super.testRequestStats(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helper to run from ide
|
||
/** | ||
* Continue collecting metrics with given OperationStats. Useful for readers and writers. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/paste error on the comment here? or just out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of date
} | ||
|
||
private void collect(OperationStats stats) { | ||
if (stats.reqAtt == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda think we shouldn't use abbreviations here, requestsAttempted
reads much nicer than Att
especially because att
can mean so many things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we have attr
floating around in the same block of code. It's a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with some minor comments 🏆
A followup issue on removing serverless flag from BlobStore#stats() #126262 |
A simplified, yet larger code size, implementation of GCP telemetry. All previous efforts to intercept HTTP calls and do meaningful correlation between operations and requests were fruitless. This approach stores stats in the ThreadLocal so HTTP response interceptor can populate request metrics for parent operation.
It's achieved with new abstraction on top of GCS interface -
MeteredStorage
. This class provides a family of "MeteredXXX" wrappers for GCS objects, such as Storage API itself,MeteredObjectsGetRequest
,MeteredWriteChannel
,MeteredReadChannel
,MeteredBlobPage
.Interesting bits for review
RepositoryStatsCollector
,OperationStats
, andMeteredStorage
.MeteredStorage
will execute GCS implementation and attach metrics to the ThreadLocal.RepositoryStatsCollector
will get stats from the ThreadLocal and populate metrics and then aggregate collected metrics and expose them to REST stats and APM.Every SDK call is wrapped into one of those helpers to populate ThreadLocal, collect results, and cleanup. See
RepositoryStatsCollector
.Response interceptor simplified to
PS, this approach does not require creation of clients for every OpearationPurpose, already removed.