Export desired balance node weight and its components as metrics #115854

pxsalehi · 2024-10-29T15:41:04Z

Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all.

some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. AllocationStatsService) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together.
How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using RoutingNodes and DesiredBalance seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code!
I liked the DesiredBalanceMetrics and how its used so I tried to clean up its existing usage a bit and colocate the new metrics.

Relates ES-9866

elasticsearchmachine · 2024-10-29T17:01:04Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2024-10-30T07:50:48Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalance.java

+public record DesiredBalance(
+    long lastConvergedIndex,
+    Map<ShardId, ShardAssignment> assignments,
+    Map<DiscoveryNode, DesiredBalanceMetrics.NodeWeightStats> weightsPerNode


It looks like this field is added to pass weightsPerNode from computer to reconciler to record stats there.

Why not record them to metrics in computer instead?
This way we do not need to pass this information nor keep it in memory until next allocation cycle.

I think the difference is that in computer we might publish stats that we never reconciled towards because it might be superseded by a new computation. I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.

I thought this makes sure we publish stats for a desired balance we have submitted a reconcile task for.

This is true, however this does not guarantee that completely reconciled those.
It is very possible that we saw them once, reconciled 1 or 2 shards out of the entire diff before receiving a new balance.

yes, but to me it seems at least those are DesiredBalance instances that we have actually acted on. Publishing from computer might publish some weights that we have not actually acted on. I think also conceptually the weight stats are not that irrelevant to the DesiredBalance. Doing all of this in Reconciler also makes reusing DesiredBalanceMetrics simpler.

It is not entirely clear what this metric represents. It is a quality of computed balance, but only if it was picked up by reconciled, but not necessarily fully reconciled. It looks more driven by the specifics of implementation rather than by clarity of what it represents.

I believe we should instead compute those metrics on top of RoutingNodes.
We can do it in the end of
(1) Reconciliation. This would represent the quality of allocation that is currently achieved in the cluster. I believe this would correctly highlight the clusters that might have computed good balance but are failing to reconcile it as it requires to many shard movements.
(2) (optionally) Computation. This would allow us to see the quality (if we are interested) of the allocation as it is computed according to the latest cluster state.

ok we discussed elsewhere with Ievgen. We clarified that the metrics exposed here are just the weights and co that have resulted in the computed desired balance (that we start to reconcile towards), i.e., the details of the final state that the balancer considers to be desired.

What Ievgen was asking for (based on RoutingNodes) is that once we know the goal that we are reconciling towards, we should also publish metrics for each middle state since it might take multiple reconciliations to arrive at the desired balance. That is something I will do in ES-9873. And the combination of these two could show what the goal is and at each specific time and how far away we are from that goal (based on the three weight components exposed here vs. their counterpart metrics that will be exposed after each reconciliation based on RoutingNodes, ClusterState, etc. similar to what is done in AllocationStatsService).

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

pxsalehi · 2024-10-30T15:01:33Z

@elasticmachine update branch

ywangd

I left some minor comments.

ywangd · 2024-10-31T01:47:48Z

...main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputer.java

@@ -368,7 +368,7 @@ public DesiredBalance compute(
        }

        long lastConvergedIndex = hasChanges ? previousDesiredBalance.lastConvergedIndex() : desiredBalanceInput.index();
-        return new DesiredBalance(lastConvergedIndex, assignments);
+        return new DesiredBalance(lastConvergedIndex, assignments, delegateAllocator.getNodeWeightStats());


I wonder whether we are interested in how many iterations it takes to get the desired balance and whether it is interupted due to things like new state?

+1. We might replicate some of the org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceStats stats as APM metrics. Additionally it would be nice to have a metric representing the fraction of reconciliations resulted in an allocation same to desired one (in other words a metric representing if cluster can keep up with computed balances).

As commented in https://github.com/elastic/elasticsearch/pull/115854/files#r1822994367, this might be part of a different issue and a separate pr.

yes, those are also interesting. there are a lot of metrics we can expose about the allocator. but not all in the same PR! :) What you're asking for seems to be part of ES-9872.

Ah I was not aware these computer level metrics are already exposed via the allocatorStats. Thanks for explaining. 👍

ywangd · 2024-10-31T01:56:35Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

@@ -159,6 +162,13 @@ public void allocate(RoutingAllocation allocation) {
        balancer.allocateUnassigned();
        balancer.moveShards();
        balancer.balance();
+
+        nodeWeightStatsSupplier = new BalancedShardsAllocatorNodeWeightStats(balancer, weightFunction)::get;


Apologies I am going to repeat my comment from the other place to try promote it one more time 🙏 I think we should return this statsSupplier instead of keeping it around as a field which unnecessarily lingers around and retains references to a bunch of objects after the computation is done. It is likely trivial in practice since this method should be called often enough so that it does not stay around for too long. But conceptually still feels a bit odd. The method computes something that is immediately needed (when necessary) by the caller. To me, this feels naturally as returning a value so that the order is clear, i.e. you cannot use the value before calling this method. I don't believe we'll do that in the code. But keeping it as a field opens up for that possibility. Personally I also generally prefer fewer mutable fields.

I am a bit concerned about this one. It captures a reference to the balancer and keeps it in memory for some time.
Balancer might be a somewhat sizeable (it has its internal model representing every node, every index and every shard).

May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.
Possibly there could be another way to reduce the scope of the stats

If we return and assign it to a local variable in a tight loop. It and its referenced balancer should not be kept around for long after each loop. My suggestion was assuming we don't want to eagerly compute the stats. If that is not the case, I'd suggest we return the stats from this method.

May be it is worth to eagerly compute stats and keep them in the field? Per node data structure sounds much more compact.

Isn't that what I had first (returned via RoutingNodes) and you suggested it might be expensive? I can move the supplier to be only the return type of allocate to further reduce how long we keep it. If we want to make that stats calculation eager or not, seems to be unclear. it is a trade-off. I'm ok with both. Considering this balancer calls are one at a time (I think), keeping the last internal state as long as the allocate call is in scope doesn't seem that concerning.

actually considering allocate has a sync and async version, for the DesiredBalanceShardAllocator this becomes a bit awkward to return from the listener these stats. I do agree that the extra stats method on the allocator interface is also awkward. Frankly, I think keeping these stats internally in RoutingNodes (eagerly calculated) is not a bad option. The pre-calculation doesn't seem that expensive and conceptually not meaningless, because each run of the balancer does produce its own middle state anyway. Also each run of the old balancer does make its changes to the RoutingNodes and records the weights that lead to this decision. So to me it seems recording these stats along with RoutingNodes is the most natural way that fits to the code. Yes, we make multiple calls to the old balancer and we don't use these stats, but that is the overall pattern with how the new balancer wraps the old balancer, e.g. each time the old balancer also rebuilds a model from routing nodes, etc. Any thoughts?

also I think what we're trying to export here as metrics, is something very internal to the allocator(s) and escalating it to the interface doesn't have any other use and in a way leaks this internal state/info. So I think RoutingNodes and DesiredBalance are suitable to pass along this extra bit of info between old allocator, new allocator and the reconciler.

For me the discussion began because we wanted to avoid the eager computation since it may be expensive. If that is not true, I don't really have strong opinon on whether it should be part of RoutingNodes or the return value as long as it is not an object field :)

I think eager or not is still relevant and orthogonal to whether we add it to the allocator interface or RoutingNodes.

going back to eagerly setting the stats in RoutingNodes after discussing with Ievgen. 1defb46

nicktindall · 2024-10-31T04:01:11Z

.../main/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceMetrics.java

+                        stat.diskUsageInBytes(),
+                        "write_load",
+                        stat.writeLoad()
+                    )


I'm pretty sure we shouldn't be recording continuous values as attribute values, as I understand it each distinct combination of attribute values effectively creates a separate metric, so having numbers like write load, disk usage, shard count in there will cause an explosion of metrics. I believe attributes should (mostly) be limited to low-cardinality, discrete values. See conversation about that here.

I wonder if instead we should track those values as separate metrics, that would also allow them to be plotted. We could have a chart that plotted the constituent values along with the resulting weight.

This is the only discussion I could find on it but I think it's not splunk-specific.

https://docs.splunk.com/observability/en/gdi/opentelemetry/tags.html#considerations-on-metric-cardinality

I agree. I do not think we can even plot labels, can we?
I suspect this could be a metric per component + the combined one.

We can plot labels. For example, the "Docker images with commit SHAs" visualization on the overview dashboard is done with labels. That is not say we should add these values as labels. I don't think Elasticsearch as a storage for APM data really cares about this. But it could be a problem for the APM agent or maybe even the APM server.

I'm pretty sure we shouldn't be recording continuous values as attribute values

You might be right. I quickly tried to plot some of the es.autoscaling.indexing.node_ingest_load.current labels (that's the example that made me think I can use labels) and I got a This field does not work with the selected function. when trying to use last_value function. not sure if it is because of the label types. Nonetheless, to be on the safe side, I'll break this up into difference metrics.

I got a This field does not work with the selected function. when trying to use last_value function.

👍 good to know.

I've split them in e79b5e1.

…WeightsAsMetric

idegtiarenko

I believe this is good to be merged (assuming the question with labels is resolved and we can confirm the metrics are plotted efficiently).
We could followup on exact code computing metrics in a followups where we compute the same set of metrics for both computed balance and a current one.

This reverts commit dad0f67.

…WeightsAsMetric' into ps241024-recordNodeWeightsAsMetric

…WeightsAsMetric

pxsalehi · 2024-11-04T12:26:38Z

Serverless Failure is #116136. I'm not able to relate that to changes in this PR.

…stic#115854) Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. - some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together. - How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code! - I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics. Relates ES-9866

At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in #115854, and relies on the existing `AllocationStatsService`. Relates ES-9873

At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in elastic#115854, and relies on the existing `AllocationStatsService`. Relates ES-9873

…5854) Note that this includes only the three node-level weight components (out of the four), as we were not sure how to aggregate and expose the index-specific component and how useful it will be at all. - some of the weight components are also recalculated or exposed as stats (not APM metrics) else where (e.g. `AllocationStatsService`) but since they are available right where we calculate the weight (which we also want), I have just exported all of them together. - How to pass the weight from the BalancedAllocator which is used as a delegated allocator in the desired balance allocator, and from there to the reconciler where we publish, could probably also be done differently, but using `RoutingNodes` and `DesiredBalance` seemed to make more sense to me. Not sure if it is blasphemy for those more familiar with the allocation code! - I liked the `DesiredBalanceMetrics` and how its used so I tried to clean up its existing usage a bit and colocate the new metrics. Relates ES-9866

At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in #115854, and relies on the existing `AllocationStatsService`. Relates ES-9873

At the end of each reconciliation round, also export the current allocation stats for each node. This is intended to show the gradual progress (or divergence!) towards the desired values exported in elastic#115854, and relies on the existing `AllocationStatsService`. Relates ES-9873

Export desired balance node weight and its components as metrics

03326df

pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Oct 29, 2024

elasticsearchmachine added the v9.0.0 label Oct 29, 2024

pxsalehi requested review from idegtiarenko and nicktindall October 29, 2024 17:00

pxsalehi marked this pull request as ready for review October 29, 2024 17:00

pxsalehi requested a review from DiannaHohensee October 29, 2024 17:01

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 29, 2024

idegtiarenko reviewed Oct 30, 2024

View reviewed changes

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java Show resolved Hide resolved

move node weight stats to the allocator interface

dad0f67

Merge branch 'main' into ps241024-recordNodeWeightsAsMetric

5ec3841

pxsalehi requested review from idegtiarenko and ywangd October 30, 2024 15:01

ywangd reviewed Oct 31, 2024

View reviewed changes

nicktindall reviewed Oct 31, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ps241024-recordNode…

d25893c

…WeightsAsMetric

idegtiarenko approved these changes Nov 4, 2024

View reviewed changes

pxsalehi added 5 commits November 4, 2024 11:34

split metrics

e79b5e1

Revert "move node weight stats to the allocator interface"

1defb46

This reverts commit dad0f67.

Merge remote-tracking branch 'refs/remotes/origin/ps241024-recordNode…

8408060

…WeightsAsMetric' into ps241024-recordNodeWeightsAsMetric

clean up

492725d

Merge remote-tracking branch 'upstream/main' into ps241024-recordNode…

e301f19

…WeightsAsMetric

pxsalehi added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Nov 4, 2024

pxsalehi merged commit 69150c8 into elastic:main Nov 4, 2024
16 checks passed

pxsalehi mentioned this pull request Nov 11, 2024

Export current node allocation stats as APM metrics #116585

Merged

ywangd mentioned this pull request Apr 10, 2025

Expose node weights in GetDesiredBalance API response #126579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export desired balance node weight and its components as metrics #115854

Export desired balance node weight and its components as metrics #115854

pxsalehi commented Oct 29, 2024 •

edited

Loading

elasticsearchmachine commented Oct 29, 2024

idegtiarenko Oct 30, 2024

pxsalehi Oct 30, 2024

idegtiarenko Oct 30, 2024

pxsalehi Oct 30, 2024

idegtiarenko Oct 30, 2024

pxsalehi Oct 30, 2024

pxsalehi commented Oct 30, 2024

ywangd left a comment

ywangd Oct 31, 2024

idegtiarenko Oct 31, 2024

pxsalehi Oct 31, 2024

ywangd Nov 1, 2024

ywangd Oct 31, 2024

idegtiarenko Oct 31, 2024

ywangd Oct 31, 2024

pxsalehi Oct 31, 2024

pxsalehi Oct 31, 2024

pxsalehi Oct 31, 2024

ywangd Nov 1, 2024

pxsalehi Nov 4, 2024

pxsalehi Nov 4, 2024

nicktindall Oct 31, 2024 •

edited

Loading

nicktindall Oct 31, 2024

idegtiarenko Oct 31, 2024

ywangd Oct 31, 2024

pxsalehi Oct 31, 2024

ywangd Nov 1, 2024

pxsalehi Nov 4, 2024

idegtiarenko left a comment

pxsalehi commented Nov 4, 2024

Export desired balance node weight and its components as metrics #115854

Export desired balance node weight and its components as metrics #115854

Conversation

pxsalehi commented Oct 29, 2024 • edited Loading

elasticsearchmachine commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxsalehi commented Oct 30, 2024

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idegtiarenko left a comment

Choose a reason for hiding this comment

pxsalehi commented Nov 4, 2024

pxsalehi commented Oct 29, 2024 •

edited

Loading

nicktindall Oct 31, 2024 •

edited

Loading