`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

n0othing · 2024-10-23T22:16:39Z

Elasticsearch Version

8.15.3

Installed Plugins

No response

Java Version

bundled

OS Version

ESS

Problem Description

Related to:

The allocate action appears to take place after the searchable snapshot action. This means a setting like total_shards_per_node (e.g. set at index creation time) can cause an index to become stuck in the cold phase and unable to complete its searchable snapshot action if the number of cold nodes can't accommodate total_shards_per_node.

Steps to Reproduce

Set up a 4x node cluster: 3x hot nodes + 1x cold node.

# Set ILM poll interval to 10s for faster testing

PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "10s"
  }
}

# Create ILM policy

PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 3
          },
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "cold": {
        "min_age": "15s",
        "actions": {
          "allocate": {
            "total_shards_per_node": 3
          },
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      }
    }
  }
}

# Create index template w/ ILM policy attached.

PUT _index_template/test-template
{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "test-policy",
          "rollover_alias": "test"
        },
        "number_of_replicas": "0",
        "number_of_shards": "3",
        "routing.allocation.total_shards_per_node": "1"
      }
    }
  },
  "index_patterns": [
    "test-*"
  ],
  "composed_of": []
}

# Bootstrap the first index

PUT test-000001
{
  "aliases": {
    "test": {
      "is_write_index": true
    }
  }
}

# Index some data

POST _bulk?refresh=wait_for
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }

# Confirm rollover after ~10s

GET _cat/indices/*test*?v

# Check allocation of our searchable snapshot

GET _cat/shards/restored-test-000001?v

index                shard prirep state      docs store dataset ip          node
restored-test-000001 0     p      STARTED       0  227b    227b 10.46.66.98 instance-0000000003
restored-test-000001 1     p      UNASSIGNED                                
restored-test-000001 2     p      UNASSIGNED

# Check allocation explain

GET _cluster/allocation/explain
{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true
}

{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NEW_INDEX_RESTORED",
    "at": "2024-10-23T22:00:03.477Z",
    "details": "restore_source[found-snapshots/2024.10.23-test-000001-test-policy-74cxzb2lsrimys49turtmg]",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "29eoo3ieTSCKiO7UtSxMmA",
      "node_name": "instance-0000000002",
      "transport_address": "10.46.65.154:19583",
      "node_attributes": {
        "logical_availability_zone": "zone-2",
        "availability_zone": "us-east4-b",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000002.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "3idS6nroSrGp0AHT3HVsAg",
      "node_name": "instance-0000000001",
      "transport_address": "10.46.64.24:19733",
      "node_attributes": {
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000001.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "aLXOkK2dShSR9AzI94EoGQ",
      "node_name": "instance-0000000003",
      "transport_address": "10.46.66.98:19272",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datacold.n2.68x10x190",
        "region": "unknown-region",
        "server_name": "instance-0000000003.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "cold"
      },
      "roles": [
        "data_cold",
        "remote_cluster_client"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "shards_limit",
          "decision": "NO",
          "explanation": "too many shards [1] allocated to this node for index [restored-test-000001], index setting [index.routing.allocation.total_shards_per_node=1]"
        }
      ]
    },
    {
      "node_id": "tKgoJAjoTTar3do03IeJxw",
      "node_name": "instance-0000000000",
      "transport_address": "10.46.66.23:19721",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-c",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000000.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    }
  ]
}

We can unstick the index by clearing total_shards_per_node

PUT restored-test-000001/_settings
{
  "index.routing.allocation.total_shards_per_node": null
}

We can also add an allocate --> total_shards_per_node in a warm phase as a workaround.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-10-25T08:57:54Z

Pinging @elastic/es-data-management (Team:Data Management)

passing · 2024-12-09T14:20:02Z

we are affected by this issue as well and discovered the following:

when ILM is moving an index from the hot to the cold phase, it is preserving the current index setting total_shards_per_node, while executing several steps including wait-for-index-color. Only after it has passed the wait-for-index-color step, ILM will update the index setting total_shards_per_node to the value that is configured in the ILM policy for the cold phase.

And so, when it is necessary that the index setting total_shards_per_node is increased so it can be allocated completely on the cold tier (like in the example given), the index will get stuck in the wait-for-index-color step and some shards will never get allocated causing a red cluster state.

If the total_shards_per_node option in the allocate action is used, ILM would need to apply it to the index before the wait-for-index-color action to solve this issue.

dakrone · 2024-12-09T16:28:39Z

This should be resolved by #112972, which allows configuring the total_shards_per_node within the searchable_snapshot action, instead of the allocate action.

passing · 2024-12-16T14:10:30Z

Hi @dakrone

I have just upgraded our cluster to v8.16.1 and changed the ILM policy according to your suggestion:

      "cold": {
        "min_age": "25h",
        "actions": {
          "set_priority": {
            "priority": 80
          },
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots",
            "force_merge_index": false,
            "total_shards_per_node": 2
          }
        }
      },

Unfortunately, that doesn't have any effect: An index that had a "total_shards_per_node": 1 in the hot phase still gets the same on the restored index in the cold tier.

Also looking at the code changed in #112972, this feature still seems to be limited to the frozen phase:

if (TimeseriesLifecycleType.FROZEN_PHASE.equals(this.getKey().phase()) && this.totalShardsPerNode == null) {
    ignoredSettings.add(ShardsLimitAllocationDecider.INDEX_TOTAL_SHARDS_PER_NODE_SETTING.getKey());
}

Therefore, can you please reopen this issue?

dakrone · 2024-12-16T18:34:25Z

Ahh yes, apologies, I missed the part that this was cold instead of frozen. I'll re-open.

VimCommando · 2025-02-28T18:51:16Z

As a workaround for anyone else hitting this, you can use the warm ILM phase to change or remove total_shards_per_node, even if you do not have warm nodes in the cluster. Because this is applied before the force merge action, and force merge triggers a snapshot, the new value gets saved, not the one from the hot phase.

Updating the test policy from the original post would look like this:

// PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            // Rollover targets, 3 documents here is for testing only.
            "max_docs": 3
            // Recommended rollover docs
            // "max_primary_shard_docs": "200m"
            // Recommended rollover size
            // "max_primary_shard_size": "30gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        // setting min_age to 0ms puts shards into warm immediately after
        // rollover if there are no warm nodes in the cluster, these shards will
        // fall back to staying on the hot nodes. 
        "min_age": "0ms",
        "actions": {
          "allocate": {
            // May be set to -1 to disable limits
            "total_shards_per_node": 3
          }
        }
      },
      "cold": {
        // Move the shard to the cold phase 15 seconds after rollover
        "min_age": "15s",
        "actions": {
          // This action will trigger the force merge and snapshot at the end
          // of the warm phase, _after_ `total_shards_per_node` was changed
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      }
    }
  }
}

dakrone · 2025-04-14T21:57:15Z

I just retested this and it does appear that #112972 did add support for it to the cold tier, not just the frozen one. I've tested the scenario on the main branch where it appears to work. I'm not sure why it did not work for you, but I'm going to close this as completed for now.

n0othing added >bug needs:triage Requires assignment of a team area label labels Oct 23, 2024

astefan added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Oct 25, 2024

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 25, 2024

dakrone closed this as completed Dec 9, 2024

dakrone reopened this Dec 16, 2024

dakrone closed this as completed Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

n0othing commented Oct 23, 2024 •

edited

Loading

elasticsearchmachine commented Oct 25, 2024

passing commented Dec 9, 2024

dakrone commented Dec 9, 2024

passing commented Dec 16, 2024 •

edited

Loading

dakrone commented Dec 16, 2024

VimCommando commented Feb 28, 2025

dakrone commented Apr 14, 2025

total_shards_per_node can prevent cold phase searchable snapshots from mounting #115479

total_shards_per_node can prevent cold phase searchable snapshots from mounting #115479

Comments

n0othing commented Oct 23, 2024 • edited Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Oct 25, 2024

passing commented Dec 9, 2024

dakrone commented Dec 9, 2024

passing commented Dec 16, 2024 • edited Loading

dakrone commented Dec 16, 2024

VimCommando commented Feb 28, 2025

dakrone commented Apr 14, 2025

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

`total_shards_per_node` can prevent cold phase searchable snapshots from mounting #115479

n0othing commented Oct 23, 2024 •

edited

Loading

passing commented Dec 16, 2024 •

edited

Loading