-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
:Data Management/ILM+SLMIndex and Snapshot lifecycle managementIndex and Snapshot lifecycle management>enhancementTeam:Data ManagementMeta label for data/management teamMeta label for data/management team
Description
Description
Elasticsearch ships with some built-in ILM policies, e.g. there is one called 30-days-default that looks like this:
{
"phases": {
"hot": {
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "30d",
"actions":{
"delete": {}
}
}
},
"_meta": {
"description": "built-in ILM policy using the hot and warm phases with a retention of 30 days",
"managed": true
}
}I would suggest adding a max_primary_shard_docs next to the max_primary_shard_size. The actual value needs more discussions, but I'm thinking of something in the order of 200M in order to get a better search experience with space-efficient datasets. So datasets where documents take more than 50GB/200M = 268 bytes would rollover at 50GB while more space-efficient datasets would rollover at 200M documents. My motivation is the following:
- While shards could rollover before reaching 50GB, they would still not be small: 200M docs is not a small number of documents for a shard. As a data point, shards have a hard limit of 2B docs.
- Search performance depends more directly on the number of docs than on byte size, so having more control on the number of docs of a shard would give stronger guarantees on the sort of worst-case latency that shards can provide. This is especially relevant given recent/ongoing efforts to improve the space efficiency of Elasticsearch with runtime fields, doc-value-only fields or synthetic source.
- Aggregations have optimizations for the case when the range filter rewrites to a
match_allquery. Bounding the number of docs per primary shard makes it a bit more likely that some shards fully match the query, and the fact that shards that partially match the range filter have a bounded number of docs also helps bound tail latencies. - Rollups can only operate or a rolled-over index, so users cannot enjoy the query speedup of rollups until the primary shard size is reached. Bounding the size of primary shards in a way that makes it easier to reason about query performance helps there too.
- Many use-cases for Elasticsearch involve
termsorcompositeaggregations that need to build global ordinals under the hood. Global ordinals need to be built for the entire shard even though the query might only match a tiny part of it. Bounding the number of docs in a shard also helps bound the time it takes to build global ordinals.
nik9000, pquentin and felixbarny
Metadata
Metadata
Assignees
Labels
:Data Management/ILM+SLMIndex and Snapshot lifecycle managementIndex and Snapshot lifecycle management>enhancementTeam:Data ManagementMeta label for data/management teamMeta label for data/management team