-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Export desired balance node weight and its components as metrics #115854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export desired balance node weight and its components as metrics #115854
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
public record DesiredBalance( | ||
long lastConvergedIndex, | ||
Map<ShardId, ShardAssignment> assignments, | ||
Map<DiscoveryNode, DesiredBalanceMetrics.NodeWeightStats> weightsPerNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this field is added to pass weightsPerNode
from computer to reconciler to record stats there.
Why not record them to metrics in computer instead?
This way we do not need to pass this information nor keep it in memory until next allocation cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the difference is that in computer we might publish stats that we never reconciled towards because it might be superseded by a new computation. I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.
This is true, however this does not guarantee that completely reconciled those.
It is very possible that we saw them once, reconciled 1 or 2 shards out of the entire diff before receiving a new balance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but to me it seems at least those are DesiredBalance
instances that we have actually acted on. Publishing from computer might publish some weights that we have not actually acted on. I think also conceptually the weight stats are not that irrelevant to the DesiredBalance
. Doing all of this in Reconciler also makes reusing DesiredBalanceMetrics
simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not entirely clear what this metric represents. It is a quality of computed balance, but only if it was picked up by reconciled, but not necessarily fully reconciled. It looks more driven by the specifics of implementation rather than by clarity of what it represents.
I believe we should instead compute those metrics on top of RoutingNodes.
We can do it in the end of
(1) Reconciliation. This would represent the quality of allocation that is currently achieved in the cluster. I believe this would correctly highlight the clusters that might have computed good balance but are failing to reconcile it as it requires to many shard movements.
(2) (optionally) Computation. This would allow us to see the quality (if we are interested) of the allocation as it is computed according to the latest cluster state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok we discussed elsewhere with Ievgen. We clarified that the metrics exposed here are just the weights and co that have resulted in the computed desired balance (that we start to reconcile towards), i.e., the details of the final state that the balancer considers to be desired.
What Ievgen was asking for (based on RoutingNodes
) is that once we know the goal that we are reconciling towards, we should also publish metrics for each middle state since it might take multiple reconciliations to arrive at the desired balance. That is something I will do in ES-9873. And the combination of these two could show what the goal is and at each specific time and how far away we are from that goal (based on the three weight components exposed here vs. their counterpart metrics that will be exposed after each reconciliation based on RoutingNodes
, ClusterState
, etc. similar to what is done in AllocationStatsService
).
...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments.
@@ -368,7 +368,7 @@ public DesiredBalance compute( | |||
} | |||
|
|||
long lastConvergedIndex = hasChanges ? previousDesiredBalance.lastConvergedIndex() : desiredBalanceInput.index(); | |||
return new DesiredBalance(lastConvergedIndex, assignments); | |||
return new DesiredBalance(lastConvergedIndex, assignments, delegateAllocator.getNodeWeightStats()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we are interested in how many iterations it takes to get the desired balance and whether it is interupted due to things like new state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We might replicate some of the org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceStats stats as APM metrics. Additionally it would be nice to have a metric representing the fraction of reconciliations resulted in an allocation same to desired one (in other words a metric representing if cluster can keep up with computed balances).
As commented in https://github.com/elastic/elasticsearch/pull/115854/files#r1822994367, this might be part of a different issue and a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, those are also interesting. there are a lot of metrics we can expose about the allocator. but not all in the same PR! :) What you're asking for seems to be part of ES-9872.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I was not aware these computer level metrics are already exposed via the allocatorStats. Thanks for explaining. 👍
@@ -159,6 +162,13 @@ public void allocate(RoutingAllocation allocation) { | |||
balancer.allocateUnassigned(); | |||
balancer.moveShards(); | |||
balancer.balance(); | |||
|
|||
nodeWeightStatsSupplier = new BalancedShardsAllocatorNodeWeightStats(balancer, weightFunction)::get; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies I am going to repeat my comment from the other place to try promote it one more time 🙏 I think we should return this statsSupplier instead of keeping it around as a field which unnecessarily lingers around and retains references to a bunch of objects after the computation is done. It is likely trivial in practice since this method should be called often enough so that it does not stay around for too long. But conceptually still feels a bit odd. The method computes something that is immediately needed (when necessary) by the caller. To me, this feels naturally as returning a value so that the order is clear, i.e. you cannot use the value before calling this method. I don't believe we'll do that in the code. But keeping it as a field opens up for that possibility. Personally I also generally prefer fewer mutable fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned about this one. It captures a reference to the balancer and keeps it in memory for some time.
Balancer might be a somewhat sizeable (it has its internal model representing every node, every index and every shard).
May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.
Possibly there could be another way to reduce the scope of the stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return and assign it to a local variable in a tight loop. It and its referenced balancer should not be kept around for long after each loop. My suggestion was assuming we don't want to eagerly compute the stats. If that is not the case, I'd suggest we return the stats from this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.
Isn't that what I had first (returned via RoutingNodes
) and you suggested it might be expensive? I can move the supplier to be only the return type of allocate
to further reduce how long we keep it. If we want to make that stats calculation eager or not, seems to be unclear. it is a trade-off. I'm ok with both. Considering this balancer calls are one at a time (I think), keeping the last internal state as long as the allocate
call is in scope doesn't seem that concerning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually considering allocate
has a sync and async version, for the DesiredBalanceShardAllocator
this becomes a bit awkward to return from the listener these stats. I do agree that the extra stats method on the allocator interface is also awkward. Frankly, I think keeping these stats internally in RoutingNodes
(eagerly calculated) is not a bad option. The pre-calculation doesn't seem that expensive and conceptually not meaningless, because each run of the balancer does produce its own middle state anyway. Also each run of the old balancer does make its changes to the RoutingNodes and records the weights that lead to this decision. So to me it seems recording these stats along with RoutingNodes is the most natural way that fits to the code. Yes, we make multiple calls to the old balancer and we don't use these stats, but that is the overall pattern with how the new balancer wraps the old balancer, e.g. each time the old balancer also rebuilds a model from routing nodes, etc. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I think what we're trying to export here as metrics, is something very internal to the allocator(s) and escalating it to the interface doesn't have any other use and in a way leaks this internal state/info. So I think RoutingNodes
and DesiredBalance
are suitable to pass along this extra bit of info between old allocator, new allocator and the reconciler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the discussion began because we wanted to avoid the eager computation since it may be expensive. If that is not true, I don't really have strong opinon on whether it should be part of RoutingNodes or the return value as long as it is not an object field :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eager or not is still relevant and orthogonal to whether we add it to the allocator interface or RoutingNodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going back to eagerly setting the stats in RoutingNodes
after discussing with Ievgen. 1defb46
stat.diskUsageInBytes(), | ||
"write_load", | ||
stat.writeLoad() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we shouldn't be recording continuous values as attribute values, as I understand it each distinct combination of attribute values effectively creates a separate metric, so having numbers like write load, disk usage, shard count in there will cause an explosion of metrics. I believe attributes should (mostly) be limited to low-cardinality, discrete values. See conversation about that here.
I wonder if instead we should track those values as separate metrics, that would also allow them to be plotted. We could have a chart that plotted the constituent values along with the resulting weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only discussion I could find on it but I think it's not splunk-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I do not think we can even plot labels, can we?
I suspect this could be a metric per component + the combined one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can plot labels. For example, the "Docker images with commit SHAs" visualization on the overview dashboard is done with labels. That is not say we should add these values as labels. I don't think Elasticsearch as a storage for APM data really cares about this. But it could be a problem for the APM agent or maybe even the APM server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we shouldn't be recording continuous values as attribute values
You might be right. I quickly tried to plot some of the es.autoscaling.indexing.node_ingest_load.current
labels (that's the example that made me think I can use labels) and I got a This field does not work with the selected function.
when trying to use last_value function. not sure if it is because of the label types. Nonetheless, to be on the safe side, I'll break this up into difference metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a This field does not work with the selected function. when trying to use last_value function.
👍 good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split them in e79b5e1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is good to be merged (assuming the question with labels is resolved and we can confirm the metrics are plotted efficiently).
We could followup on exact code computing metrics in a followups where we compute the same set of metrics for both computed balance and a current one.
This reverts commit dad0f67.
…WeightsAsMetric' into ps241024-recordNodeWeightsAsMetric
Serverless Failure is #116136. I'm not able to relate that to changes in this PR. |
…stic#115854) Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. - some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together. - How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code! - I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics. Relates ES-9866
At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in #115854, and relies on the existing `AllocationStatsService`. Relates ES-9873
At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in elastic#115854, and relies on the existing `AllocationStatsService`. Relates ES-9873
…5854) Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. - some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together. - How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code! - I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics. Relates ES-9866
At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in #115854, and relies on the existing `AllocationStatsService`. Relates ES-9873
At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in elastic#115854, and relies on the existing `AllocationStatsService`. Relates ES-9873
At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in elastic#115854, and relies on the existing `AllocationStatsService`. Relates ES-9873
Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all.
AllocationStatsService
) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together.RoutingNodes
andDesiredBalance
seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code!DesiredBalanceMetrics
and how its used so I tried to clean up its existing usage a bit and colocate the new metrics.Relates ES-9866