Add to allocation architecture guide #125328

DiannaHohensee · 2025-03-20T16:28:32Z

Adds discussion of index shards and their states, as well as
the communication flow between the master node and data
nodes for shard allocation changes.

Relates ES-7874

The first section is an attempt to move some of the allocation brain dump google document into the architecture guide.

elasticsearchmachine · 2025-03-20T16:29:37Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DiannaHohensee · 2025-03-20T21:11:03Z

CC @JeremyDahlgren just fyi.

DaveCTurner

Info is all good but I think would be better placed in Javadocs, either duplicated here or else with pointers from here to the Javadocs.

DaveCTurner · 2025-03-21T08:25:52Z

docs/internal/DistributedArchitectureGuide.md

+### Indexes and Shards
+
+Each index consists of a fixed number of primary shards. The number of primary shards cannot be changed for the lifetime of the index. Each
+primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each


Suggested change

primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each

primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed dynamically. Each

(or maybe "at runtime")?

Applied dynamically 👍

DaveCTurner · 2025-03-21T08:31:56Z

docs/internal/DistributedArchitectureGuide.md

+
+Each index consists of a fixed number of primary shards. The number of primary shards cannot be changed for the lifetime of the index. Each
+primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each
+shard copy (primary or replica) can be in one of four states:


Maybe worth mentioning here that these states are org.elasticsearch.cluster.routing.ShardRoutingState the ones in the routing table (i.e. within the ClusterState) and the transitions between these states are part of the dance between data node and master node to reflect the lifecycle of a shard:

UNASSIGNED -> INITIALIZING happens when the master wants the data node to start creating this shard copy.
INITIALIZING -> STARTED happens when recovery is fully complete and the data node tells the master it's ready to serve requests.
STARTED -> RELOCATING happens when the master wants to initialize the node elsewhere.

A failure can take a shard in any state back to UNASSIGNED. Or the shard entry can be removed entirely from the cluster state. In either case, that tells the data node to stop whatever it is doing and shut down the shard.

Also IMO it'd be more discoverable to expand the Javadocs for ShardRoutingState with all this detail rather than hiding it away here.

Thanks for the info. I was trying to shove Ievgen's brain dump info into here, but I wasn't that familiar with the relevant code. I agree, this information is too low-level. I pushed it down into ShardRoutingState in a separate PR because I want to add a reference to the new documentation in this PR.

I wasn't too sure where to push the state transitions. I guess we can prescribe how to transition states in the ShardRoutingState, too, but it's a little strange to describe how the class is used by other classes when ShardRoutingState doesn't enforce those transitions itself. Oh well 🤷‍♀️

DaveCTurner · 2025-03-21T08:33:40Z

docs/internal/DistributedArchitectureGuide.md

+updated `RoutingTable`. The `RoutingTable` is part of the cluster state, so the master node updates the cluster state with the new
+(incremental) desired shard allocation information. The updated cluster state is then published to the data nodes. Each data node will
+observe any change in shard allocation related to that node and take action to achieve the new shard allocation by initiating creation of a
+new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes


grammar nit:

Suggested change

new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes

new empty shard, starting recovery (copying) of an existing shard from another data node, or removing a shard. When the data node finishes

but also I feel we should expand a bit on what "removing a shard" means in this context? It means actively shutting down the corresponding running (or recovering) IndexShard instance, releasing all its resources.

Again, this feels like detail that should be in a Javadoc somewhere, maybe org.elasticsearch.cluster.routing.GlobalRoutingTable, and linked to from RoutingTable, IndexRoutingTable, IndexShardRoutingTable and ShardRouting.

Applied removing 👍

but also I feel we should expand a bit on what "removing a shard" means in this context? It means actively shutting down the corresponding running (or recovering) IndexShard instance, releasing all its resources.

The details of removal of a shard seem akin to the details of recovery of a shard, which is a separate component than allocation? It seems off topic to me: this section explains how the master and a data node communicate to change shard allocations. Though perhaps you're thinking about it from a some other angle?

Again, this feels like detail that should be in a Javadoc somewhere, maybe org.elasticsearch.cluster.routing.GlobalRoutingTable, and linked to from RoutingTable, IndexRoutingTable, IndexShardRoutingTable and ShardRouting.

We should be able to drill down from the high level architecture guide to the package-info and down into a class file, with increasing amounts of detail. The presence of lower level documentation wouldn't eliminate the need for the architecture guide.

I can't find the classes that would contain additional documentation unless I know how to find them. I had not previously heard of GlobalRoutingTable. I consider this documentation's level of detail insufficient for a class level comment and too wide ranging across multiple classes.

DiannaHohensee

Thanks for the feedback, I've finally updated the text.

Info is all good but I think would be better placed in Javadocs, either duplicated here or else with pointers from here to the Javadocs.

I think it depends on the level of detail, and there's a discoverability concern. I've responded in the threads. I've also spun off a separate PR that should go in first.

DiannaHohensee · 2025-04-15T18:37:53Z

docs/internal/DistributedArchitectureGuide.md

+### Indexes and Shards
+
+Each index consists of a fixed number of primary shards. The number of primary shards cannot be changed for the lifetime of the index. Each
+primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each


Applied dynamically 👍

DiannaHohensee · 2025-04-15T18:43:59Z

docs/internal/DistributedArchitectureGuide.md

+updated `RoutingTable`. The `RoutingTable` is part of the cluster state, so the master node updates the cluster state with the new
+(incremental) desired shard allocation information. The updated cluster state is then published to the data nodes. Each data node will
+observe any change in shard allocation related to that node and take action to achieve the new shard allocation by initiating creation of a
+new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes


Applied removing 👍

but also I feel we should expand a bit on what "removing a shard" means in this context? It means actively shutting down the corresponding running (or recovering) IndexShard instance, releasing all its resources.

The details of removal of a shard seem akin to the details of recovery of a shard, which is a separate component than allocation? It seems off topic to me: this section explains how the master and a data node communicate to change shard allocations. Though perhaps you're thinking about it from a some other angle?

DiannaHohensee · 2025-04-15T19:00:07Z

docs/internal/DistributedArchitectureGuide.md

+updated `RoutingTable`. The `RoutingTable` is part of the cluster state, so the master node updates the cluster state with the new
+(incremental) desired shard allocation information. The updated cluster state is then published to the data nodes. Each data node will
+observe any change in shard allocation related to that node and take action to achieve the new shard allocation by initiating creation of a
+new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes


Again, this feels like detail that should be in a Javadoc somewhere, maybe org.elasticsearch.cluster.routing.GlobalRoutingTable, and linked to from RoutingTable, IndexRoutingTable, IndexShardRoutingTable and ShardRouting.

We should be able to drill down from the high level architecture guide to the package-info and down into a class file, with increasing amounts of detail. The presence of lower level documentation wouldn't eliminate the need for the architecture guide.

I can't find the classes that would contain additional documentation unless I know how to find them. I had not previously heard of GlobalRoutingTable. I consider this documentation's level of detail insufficient for a class level comment and too wide ranging across multiple classes.

DiannaHohensee · 2025-04-15T19:50:47Z

docs/internal/DistributedArchitectureGuide.md

+
+Each index consists of a fixed number of primary shards. The number of primary shards cannot be changed for the lifetime of the index. Each
+primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each
+shard copy (primary or replica) can be in one of four states:


Thanks for the info. I was trying to shove Ievgen's brain dump info into here, but I wasn't that familiar with the relevant code. I agree, this information is too low-level. I pushed it down into ShardRoutingState in a separate PR because I want to add a reference to the new documentation in this PR.

I wasn't too sure where to push the state transitions. I guess we can prescribe how to transition states in the ShardRoutingState, too, but it's a little strange to describe how the class is used by other classes when ShardRoutingState doesn't enforce those transitions itself. Oh well 🤷‍♀️

DiannaHohensee · 2025-04-18T13:38:04Z

#126875 has been committed and I've updated the text here to reference it. Ready for another review 👍

DaveCTurner

LGTM; we could reasonably link to here from the new ShardRoutingState docs to aid with discoverability.

docs/internal/DistributedArchitectureGuide.md

Co-authored-by: David Turner <[email protected]>

DiannaHohensee · 2025-04-18T14:21:37Z

we could reasonably link to here from the new ShardRoutingState docs to aid with discoverability.

I was thinking that the java class level documentation was hard to find, not the architecture guide 🙃 I see expectations vary 😁

DaveCTurner · 2025-04-18T14:23:25Z

Links all round :)

Add to allocation architecture guide

a35dfa1

DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team labels Mar 20, 2025

DiannaHohensee self-assigned this Mar 20, 2025

elasticsearchmachine added the v9.1.0 label Mar 20, 2025

DiannaHohensee requested a review from DaveCTurner March 20, 2025 21:10

DaveCTurner reviewed Mar 21, 2025

View reviewed changes

Merge branch 'main' into 2025/03/20/allocation-guide

c3b987f

DiannaHohensee mentioned this pull request Apr 15, 2025

Improve ShardRoutingState docs #126875

Merged

review changes, remove ShardRoutingState info

10e6a91

DiannaHohensee commented Apr 15, 2025

View reviewed changes

DiannaHohensee added 2 commits April 18, 2025 09:25

Merge branch 'main' into 2025/03/20/allocation-guide

a705892

touch ups, reference the new ShardRoutingState documentation

101b097

DiannaHohensee requested a review from DaveCTurner April 18, 2025 13:37

DaveCTurner approved these changes Apr 18, 2025

View reviewed changes

docs/internal/DistributedArchitectureGuide.md Outdated Show resolved Hide resolved

Update docs/internal/DistributedArchitectureGuide.md

2d980f5

Co-authored-by: David Turner <[email protected]>

DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 18, 2025

DiannaHohensee merged commit 72b4ed2 into elastic:main Apr 18, 2025
6 of 7 checks passed

DiannaHohensee deleted the 2025/03/20/allocation-guide branch April 18, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add to allocation architecture guide #125328

Add to allocation architecture guide #125328

DiannaHohensee commented Mar 20, 2025 •

edited

Loading

elasticsearchmachine commented Mar 20, 2025

DiannaHohensee commented Mar 20, 2025

DaveCTurner left a comment

DaveCTurner Mar 21, 2025

DiannaHohensee Apr 15, 2025

DaveCTurner Mar 21, 2025

DiannaHohensee Apr 15, 2025

DaveCTurner Mar 21, 2025

DiannaHohensee Apr 15, 2025

DiannaHohensee Apr 15, 2025

DiannaHohensee left a comment •

edited

Loading

DiannaHohensee Apr 15, 2025

DiannaHohensee Apr 15, 2025

DiannaHohensee Apr 15, 2025

DiannaHohensee Apr 15, 2025

DiannaHohensee commented Apr 18, 2025

DaveCTurner left a comment

DiannaHohensee commented Apr 18, 2025

DaveCTurner commented Apr 18, 2025

	primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed in runtime. Each
	primary shard can have zero-to-many replicas used for data redundancy. The number of replicas per shard can be changed dynamically. Each

	new empty shard, starting recovery (copying) of an existing shard from another data node, or remove a shard. When the data node finishes
	new empty shard, starting recovery (copying) of an existing shard from another data node, or removing a shard. When the data node finishes

Add to allocation architecture guide #125328

Add to allocation architecture guide #125328

Conversation

DiannaHohensee commented Mar 20, 2025 • edited Loading

elasticsearchmachine commented Mar 20, 2025

DiannaHohensee commented Mar 20, 2025

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiannaHohensee left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiannaHohensee commented Apr 18, 2025

DaveCTurner left a comment

Choose a reason for hiding this comment

DiannaHohensee commented Apr 18, 2025

DaveCTurner commented Apr 18, 2025

DiannaHohensee commented Mar 20, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading