Skip to content

total_shards_per_node can prevent cold phase searchable snapshots from mounting #115479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
n0othing opened this issue Oct 23, 2024 · 7 comments
Closed
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@n0othing
Copy link
Member

n0othing commented Oct 23, 2024

Elasticsearch Version

8.15.3

Installed Plugins

No response

Java Version

bundled

OS Version

ESS

Problem Description

Related to:

The allocate action appears to take place after the searchable snapshot action. This means a setting like total_shards_per_node (e.g. set at index creation time) can cause an index to become stuck in the cold phase and unable to complete its searchable snapshot action if the number of cold nodes can't accommodate total_shards_per_node.

Steps to Reproduce

Set up a 4x node cluster: 3x hot nodes + 1x cold node.

# Set ILM poll interval to 10s for faster testing

PUT _cluster/settings
{
  "persistent": {
    "indices.lifecycle.poll_interval": "10s"
  }
}
# Create ILM policy

PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_docs": 3
          },
          "set_priority": {
            "priority": 100
          }
        },
        "min_age": "0ms"
      },
      "cold": {
        "min_age": "15s",
        "actions": {
          "allocate": {
            "total_shards_per_node": 3
          },
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      }
    }
  }
}
# Create index template w/ ILM policy attached.

PUT _index_template/test-template
{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "test-policy",
          "rollover_alias": "test"
        },
        "number_of_replicas": "0",
        "number_of_shards": "3",
        "routing.allocation.total_shards_per_node": "1"
      }
    }
  },
  "index_patterns": [
    "test-*"
  ],
  "composed_of": []
}
# Bootstrap the first index

PUT test-000001
{
  "aliases": {
    "test": {
      "is_write_index": true
    }
  }
}
# Index some data

POST _bulk?refresh=wait_for
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
{ "index" : { "_index" : "test" } }
{ "field" : "Hello World!" }
# Confirm rollover after ~10s

GET _cat/indices/*test*?v
# Check allocation of our searchable snapshot

GET _cat/shards/restored-test-000001?v

index                shard prirep state      docs store dataset ip          node
restored-test-000001 0     p      STARTED       0  227b    227b 10.46.66.98 instance-0000000003
restored-test-000001 1     p      UNASSIGNED                                
restored-test-000001 2     p      UNASSIGNED                
# Check allocation explain

GET _cluster/allocation/explain
{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true
}
{
  "index": "restored-test-000001",
  "shard": 1,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NEW_INDEX_RESTORED",
    "at": "2024-10-23T22:00:03.477Z",
    "details": "restore_source[found-snapshots/2024.10.23-test-000001-test-policy-74cxzb2lsrimys49turtmg]",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "29eoo3ieTSCKiO7UtSxMmA",
      "node_name": "instance-0000000002",
      "transport_address": "10.46.65.154:19583",
      "node_attributes": {
        "logical_availability_zone": "zone-2",
        "availability_zone": "us-east4-b",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000002.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "3idS6nroSrGp0AHT3HVsAg",
      "node_name": "instance-0000000001",
      "transport_address": "10.46.64.24:19733",
      "node_attributes": {
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000001.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id": "aLXOkK2dShSR9AzI94EoGQ",
      "node_name": "instance-0000000003",
      "transport_address": "10.46.66.98:19272",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-a",
        "instance_configuration": "gcp.es.datacold.n2.68x10x190",
        "region": "unknown-region",
        "server_name": "instance-0000000003.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "cold"
      },
      "roles": [
        "data_cold",
        "remote_cluster_client"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "shards_limit",
          "decision": "NO",
          "explanation": "too many shards [1] allocated to this node for index [restored-test-000001], index setting [index.routing.allocation.total_shards_per_node=1]"
        }
      ]
    },
    {
      "node_id": "tKgoJAjoTTar3do03IeJxw",
      "node_name": "instance-0000000000",
      "transport_address": "10.46.66.23:19721",
      "node_attributes": {
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-east4-c",
        "instance_configuration": "gcp.es.datahot.n2.68x32x45",
        "region": "unknown-region",
        "server_name": "instance-0000000000.0fef76a1c07c49d29a1b3dd34e5fae4d",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot"
      },
      "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "no",
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    }
  ]
}

We can unstick the index by clearing total_shards_per_node

PUT restored-test-000001/_settings
{
  "index.routing.allocation.total_shards_per_node": null
}

We can also add an allocate --> total_shards_per_node in a warm phase as a workaround.

Logs (if relevant)

No response

@n0othing n0othing added >bug needs:triage Requires assignment of a team area label labels Oct 23, 2024
@astefan astefan added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Oct 25, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 25, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@passing
Copy link

passing commented Dec 9, 2024

we are affected by this issue as well and discovered the following:

when ILM is moving an index from the hot to the cold phase, it is preserving the current index setting total_shards_per_node, while executing several steps including wait-for-index-color. Only after it has passed the wait-for-index-color step, ILM will update the index setting total_shards_per_node to the value that is configured in the ILM policy for the cold phase.

And so, when it is necessary that the index setting total_shards_per_node is increased so it can be allocated completely on the cold tier (like in the example given), the index will get stuck in the wait-for-index-color step and some shards will never get allocated causing a red cluster state.

If the total_shards_per_node option in the allocate action is used, ILM would need to apply it to the index before the wait-for-index-color action to solve this issue.

@dakrone
Copy link
Member

dakrone commented Dec 9, 2024

This should be resolved by #112972, which allows configuring the total_shards_per_node within the searchable_snapshot action, instead of the allocate action.

@dakrone dakrone closed this as completed Dec 9, 2024
@passing
Copy link

passing commented Dec 16, 2024

Hi @dakrone

I have just upgraded our cluster to v8.16.1 and changed the ILM policy according to your suggestion:

      "cold": {
        "min_age": "25h",
        "actions": {
          "set_priority": {
            "priority": 80
          },
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots",
            "force_merge_index": false,
            "total_shards_per_node": 2
          }
        }
      },

Unfortunately, that doesn't have any effect: An index that had a "total_shards_per_node": 1 in the hot phase still gets the same on the restored index in the cold tier.

Also looking at the code changed in #112972, this feature still seems to be limited to the frozen phase:

if (TimeseriesLifecycleType.FROZEN_PHASE.equals(this.getKey().phase()) && this.totalShardsPerNode == null) {
    ignoredSettings.add(ShardsLimitAllocationDecider.INDEX_TOTAL_SHARDS_PER_NODE_SETTING.getKey());
}

Therefore, can you please reopen this issue?

@dakrone
Copy link
Member

dakrone commented Dec 16, 2024

Ahh yes, apologies, I missed the part that this was cold instead of frozen. I'll re-open.

@dakrone dakrone reopened this Dec 16, 2024
@VimCommando
Copy link
Contributor

As a workaround for anyone else hitting this, you can use the warm ILM phase to change or remove total_shards_per_node, even if you do not have warm nodes in the cluster. Because this is applied before the force merge action, and force merge triggers a snapshot, the new value gets saved, not the one from the hot phase.

Updating the test policy from the original post would look like this:

// PUT _ilm/policy/test-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            // Rollover targets, 3 documents here is for testing only.
            "max_docs": 3
            // Recommended rollover docs
            // "max_primary_shard_docs": "200m"
            // Recommended rollover size
            // "max_primary_shard_size": "30gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        // setting min_age to 0ms puts shards into warm immediately after
        // rollover if there are no warm nodes in the cluster, these shards will
        // fall back to staying on the hot nodes. 
        "min_age": "0ms",
        "actions": {
          "allocate": {
            // May be set to -1 to disable limits
            "total_shards_per_node": 3
          }
        }
      },
      "cold": {
        // Move the shard to the cold phase 15 seconds after rollover
        "min_age": "15s",
        "actions": {
          // This action will trigger the force merge and snapshot at the end
          // of the warm phase, _after_ `total_shards_per_node` was changed
          "searchable_snapshot": {
            "snapshot_repository": "found-snapshots"
          }
        }
      }
    }
  }
}

@dakrone
Copy link
Member

dakrone commented Apr 14, 2025

I just retested this and it does appear that #112972 did add support for it to the cold tier, not just the frozen one. I've tested the scenario on the main branch where it appears to work. I'm not sure why it did not work for you, but I'm going to close this as completed for now.

@dakrone dakrone closed this as completed Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

6 participants