Add a `max_primary_shard_docs` to default ILM policies

### Description

Elasticsearch ships with some built-in ILM policies, e.g. there is one called `30-days-default` that looks like this:

```json
{
  "phases": {
    "hot": {
      "actions": {
        "rollover": {
          "max_primary_shard_size": "50gb",
          "max_age": "30d"
        }
      }
    },
    "warm": {
      "min_age": "2d",
      "actions": {
        "shrink": {
          "number_of_shards": 1
        },
        "forcemerge": {
          "max_num_segments": 1
        }
      }
    },
    "delete": {
      "min_age": "30d",
      "actions":{
        "delete": {}
      }
    }
  },
  "_meta": {
    "description": "built-in ILM policy using the hot and warm phases with a retention of 30 days",
    "managed": true
  }
}
```

I would suggest adding a `max_primary_shard_docs` next to the `max_primary_shard_size`. The actual value needs more discussions, but I'm thinking of something in the order of 200M in order to get a better search experience with space-efficient datasets. So datasets where documents take more than 50GB/200M = 268 bytes would rollover at 50GB while more space-efficient datasets would rollover at 200M documents. My motivation is the following:
 - While shards could rollover before reaching 50GB, they would still not be small: 200M docs is not a small number of documents for a shard. As a data point, shards have a hard limit of 2B docs.
 - Search performance depends more directly on the number of docs than on byte size, so having more control on the number of docs of a shard would give stronger guarantees on the sort of worst-case latency that shards can provide. This is especially relevant given recent/ongoing efforts to improve the space efficiency of Elasticsearch with runtime fields, doc-value-only fields or synthetic source.
 - Aggregations have optimizations for the case when the range filter rewrites to a `match_all` query. Bounding the number of docs per primary shard makes it a bit more likely that some shards fully match the query, and the fact that shards that partially match the range filter have a bounded number of docs also helps bound tail latencies.
 - Rollups can only operate or a rolled-over index, so users cannot enjoy the query speedup of rollups until the primary shard size is reached. Bounding the size of primary shards in a way that makes it easier to reason about query performance helps there too.
 - Many use-cases for Elasticsearch involve `terms` or `composite` aggregations that need to build global ordinals under the hood. Global ordinals need to be built for the entire shard even though the query might only match a tiny part of it. Bounding the number of docs in a shard also helps bound the time it takes to build global ordinals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a `max_primary_shard_docs` to default ILM policies #87246

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a max_primary_shard_docs to default ILM policies #87246

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add a `max_primary_shard_docs` to default ILM policies #87246