Skip to content

Export desired balance node weight and its components as metrics #115854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 4, 2024

Conversation

pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Oct 29, 2024

Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all.

  • some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. AllocationStatsService) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together.
  • How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using RoutingNodes and DesiredBalance seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code!
  • I liked the DesiredBalanceMetrics and how its used so I tried to clean up its existing usage a bit and colocate the new metrics.

Relates ES-9866

@pxsalehi pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 29, 2024
@pxsalehi pxsalehi marked this pull request as ready for review October 29, 2024 17:00
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 29, 2024
public record DesiredBalance(
long lastConvergedIndex,
Map<ShardId, ShardAssignment> assignments,
Map<DiscoveryNode, DesiredBalanceMetrics.NodeWeightStats> weightsPerNode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this field is added to pass weightsPerNode from computer to reconciler to record stats there.

Why not record them to metrics in computer instead?
This way we do not need to pass this information nor keep it in memory until next allocation cycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference is that in computer we might publish stats that we never reconciled towards because it might be superseded by a new computation. I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.

This is true, however this does not guarantee that completely reconciled those.
It is very possible that we saw them once, reconciled 1 or 2 shards out of the entire diff before receiving a new balance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but to me it seems at least those are DesiredBalance instances that we have actually acted on. Publishing from computer might publish some weights that we have not actually acted on. I think also conceptually the weight stats are not that irrelevant to the DesiredBalance. Doing all of this in Reconciler also makes reusing DesiredBalanceMetrics simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not entirely clear what this metric represents. It is a quality of computed balance, but only if it was picked up by reconciled, but not necessarily fully reconciled. It looks more driven by the specifics of implementation rather than by clarity of what it represents.

I believe we should instead compute those metrics on top of RoutingNodes.
We can do it in the end of
(1) Reconciliation. This would represent the quality of allocation that is currently achieved in the cluster. I believe this would correctly highlight the clusters that might have computed good balance but are failing to reconcile it as it requires to many shard movements.
(2) (optionally) Computation. This would allow us to see the quality (if we are interested) of the allocation as it is computed according to the latest cluster state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we discussed elsewhere with Ievgen. We clarified that the metrics exposed here are just the weights and co that have resulted in the computed desired balance (that we start to reconcile towards), i.e., the details of the final state that the balancer considers to be desired.

What Ievgen was asking for (based on RoutingNodes) is that once we know the goal that we are reconciling towards, we should also publish metrics for each middle state since it might take multiple reconciliations to arrive at the desired balance. That is something I will do in ES-9873. And the combination of these two could show what the goal is and at each specific time and how far away we are from that goal (based on the three weight components exposed here vs. their counterpart metrics that will be exposed after each reconciliation based on RoutingNodes, ClusterState, etc. similar to what is done in AllocationStatsService).

@pxsalehi
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments.

@@ -368,7 +368,7 @@ public DesiredBalance compute(
}

long lastConvergedIndex = hasChanges ? previousDesiredBalance.lastConvergedIndex() : desiredBalanceInput.index();
return new DesiredBalance(lastConvergedIndex, assignments);
return new DesiredBalance(lastConvergedIndex, assignments, delegateAllocator.getNodeWeightStats());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we are interested in how many iterations it takes to get the desired balance and whether it is interupted due to things like new state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We might replicate some of the org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceStats stats as APM metrics. Additionally it would be nice to have a metric representing the fraction of reconciliations resulted in an allocation same to desired one (in other words a metric representing if cluster can keep up with computed balances).

As commented in https://github.com/elastic/elasticsearch/pull/115854/files#r1822994367, this might be part of a different issue and a separate pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, those are also interesting. there are a lot of metrics we can expose about the allocator. but not all in the same PR! :) What you're asking for seems to be part of ES-9872.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was not aware these computer level metrics are already exposed via the allocatorStats. Thanks for explaining. 👍

@@ -159,6 +162,13 @@ public void allocate(RoutingAllocation allocation) {
balancer.allocateUnassigned();
balancer.moveShards();
balancer.balance();

nodeWeightStatsSupplier = new BalancedShardsAllocatorNodeWeightStats(balancer, weightFunction)::get;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I am going to repeat my comment from the other place to try promote it one more time 🙏 I think we should return this statsSupplier instead of keeping it around as a field which unnecessarily lingers around and retains references to a bunch of objects after the computation is done. It is likely trivial in practice since this method should be called often enough so that it does not stay around for too long. But conceptually still feels a bit odd. The method computes something that is immediately needed (when necessary) by the caller. To me, this feels naturally as returning a value so that the order is clear, i.e. you cannot use the value before calling this method. I don't believe we'll do that in the code. But keeping it as a field opens up for that possibility. Personally I also generally prefer fewer mutable fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about this one. It captures a reference to the balancer and keeps it in memory for some time.
Balancer might be a somewhat sizeable (it has its internal model representing every node, every index and every shard).

May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.
Possibly there could be another way to reduce the scope of the stats

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return and assign it to a local variable in a tight loop. It and its referenced balancer should not be kept around for long after each loop. My suggestion was assuming we don't want to eagerly compute the stats. If that is not the case, I'd suggest we return the stats from this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.

Isn't that what I had first (returned via RoutingNodes) and you suggested it might be expensive? I can move the supplier to be only the return type of allocate to further reduce how long we keep it. If we want to make that stats calculation eager or not, seems to be unclear. it is a trade-off. I'm ok with both. Considering this balancer calls are one at a time (I think), keeping the last internal state as long as the allocate call is in scope doesn't seem that concerning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually considering allocate has a sync and async version, for the DesiredBalanceShardAllocator this becomes a bit awkward to return from the listener these stats. I do agree that the extra stats method on the allocator interface is also awkward. Frankly, I think keeping these stats internally in RoutingNodes (eagerly calculated) is not a bad option. The pre-calculation doesn't seem that expensive and conceptually not meaningless, because each run of the balancer does produce its own middle state anyway. Also each run of the old balancer does make its changes to the RoutingNodes and records the weights that lead to this decision. So to me it seems recording these stats along with RoutingNodes is the most natural way that fits to the code. Yes, we make multiple calls to the old balancer and we don't use these stats, but that is the overall pattern with how the new balancer wraps the old balancer, e.g. each time the old balancer also rebuilds a model from routing nodes, etc. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I think what we're trying to export here as metrics, is something very internal to the allocator(s) and escalating it to the interface doesn't have any other use and in a way leaks this internal state/info. So I think RoutingNodes and DesiredBalance are suitable to pass along this extra bit of info between old allocator, new allocator and the reconciler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the discussion began because we wanted to avoid the eager computation since it may be expensive. If that is not true, I don't really have strong opinon on whether it should be part of RoutingNodes or the return value as long as it is not an object field :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eager or not is still relevant and orthogonal to whether we add it to the allocator interface or RoutingNodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going back to eagerly setting the stats in RoutingNodes after discussing with Ievgen. 1defb46

stat.diskUsageInBytes(),
"write_load",
stat.writeLoad()
)
Copy link
Contributor

@nicktindall nicktindall Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we shouldn't be recording continuous values as attribute values, as I understand it each distinct combination of attribute values effectively creates a separate metric, so having numbers like write load, disk usage, shard count in there will cause an explosion of metrics. I believe attributes should (mostly) be limited to low-cardinality, discrete values. See conversation about that here.

I wonder if instead we should track those values as separate metrics, that would also allow them to be plotted. We could have a chart that plotted the constituent values along with the resulting weight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only discussion I could find on it but I think it's not splunk-specific.

https://docs.splunk.com/observability/en/gdi/opentelemetry/tags.html#considerations-on-metric-cardinality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I do not think we can even plot labels, can we?
I suspect this could be a metric per component + the combined one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can plot labels. For example, the "Docker images with commit SHAs" visualization on the overview dashboard is done with labels. That is not say we should add these values as labels. I don't think Elasticsearch as a storage for APM data really cares about this. But it could be a problem for the APM agent or maybe even the APM server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we shouldn't be recording continuous values as attribute values

You might be right. I quickly tried to plot some of the es.autoscaling.indexing.node_ingest_load.current labels (that's the example that made me think I can use labels) and I got a This field does not work with the selected function. when trying to use last_value function. not sure if it is because of the label types. Nonetheless, to be on the safe side, I'll break this up into difference metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a This field does not work with the selected function. when trying to use last_value function.

👍 good to know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split them in e79b5e1.

Copy link
Contributor

@idegtiarenko idegtiarenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is good to be merged (assuming the question with labels is resolved and we can confirm the metrics are plotted efficiently).
We could followup on exact code computing metrics in a followups where we compute the same set of metrics for both computed balance and a current one.

@pxsalehi pxsalehi added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Nov 4, 2024
@pxsalehi
Copy link
Member Author

pxsalehi commented Nov 4, 2024

Serverless Failure is #116136. I'm not able to relate that to changes in this PR.

@pxsalehi pxsalehi merged commit 69150c8 into elastic:main Nov 4, 2024
16 checks passed
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…stic#115854)

Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. 

- some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together.
- How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code!
- I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics.

Relates ES-9866
elasticsearchmachine pushed a commit that referenced this pull request Nov 12, 2024
At the end of each reconciliation round, also export the current
allocation stats for each node. This is intended to show the gradual
progress (or divergence!) towards the desired values exported in
#115854, and relies on the
existing `AllocationStatsService`.

Relates ES-9873
javanna pushed a commit to javanna/elasticsearch that referenced this pull request Nov 12, 2024
At the end of each reconciliation round, also export the current
allocation stats for each node. This is intended to show the gradual
progress (or divergence!) towards the desired values exported in
elastic#115854, and relies on the
existing `AllocationStatsService`.

Relates ES-9873
jozala pushed a commit that referenced this pull request Nov 13, 2024
…5854)

Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. 

- some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together.
- How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code!
- I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics.

Relates ES-9866
jozala pushed a commit that referenced this pull request Nov 13, 2024
At the end of each reconciliation round, also export the current
allocation stats for each node. This is intended to show the gradual
progress (or divergence!) towards the desired values exported in
#115854, and relies on the
existing `AllocationStatsService`.

Relates ES-9873
afoucret pushed a commit to afoucret/elasticsearch that referenced this pull request Nov 14, 2024
At the end of each reconciliation round, also export the current
allocation stats for each node. This is intended to show the gradual
progress (or divergence!) towards the desired values exported in
elastic#115854, and relies on the
existing `AllocationStatsService`.

Relates ES-9873
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
At the end of each reconciliation round, also export the current
allocation stats for each node. This is intended to show the gradual
progress (or divergence!) towards the desired values exported in
elastic#115854, and relies on the
existing `AllocationStatsService`.

Relates ES-9873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants