-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Check array size at parse time #129997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Check array size at parse time #129997
Conversation
…/elasticsearch into check-array-size-at-parse-time
…/elasticsearch into check-array-size-at-parse-time
…/elasticsearch into check-array-size-at-parse-time
@@ -749,8 +749,20 @@ private static void parseNonDynamicArray( | |||
XContentParser.Token token; | |||
XContentParser.Token previousToken = parser.currentToken(); | |||
int elements = 0; | |||
int countArray = 0; | |||
long nestedDocsLimit = context.indexSettings().getMappingNestedDocsLimit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if you are looking at the correct limit for the scenario you are covering. Could you review https://www.elastic.co/guide/en/elasticsearch/reference/8.18/mapping-settings-limit.html and its different limits please? The one you are checking is specific for the nested type, I am not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this setting because it controls the number of elements allowed in arrays when the field is defined as a nested
array. I couldn't find a more suitable existing one. If we follow this approach, should we create a new one?
PUT /index-large-array
{
"mappings": {
"properties": {
"array": {
"type": "nested",
"properties": {
"value": { "type": "integer" }
}
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I see with the setting you used is that it's specific to the nested type, re-used for something unrelated with your change. I am not sure that's a good idea. Maybe it deserves discussion with the team. An alternative is index.mapping.depth.limit
but it's also not a great fit. Adding a new one is an option too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced a new setting, index.mapping.array_objects.limit
, to enforce a limit on array size during parsing.
while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) { | ||
if (token == XContentParser.Token.START_OBJECT) { | ||
if (countArray++ >= nestedDocsLimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this will be a performance hit when parsing documents perhaps. We should have micro benchmarks for document parsing under ./benchmarks. It is probably worth double checking the impact of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have micro benchmarks for document parsing under ./benchmarks.
What’s the recommended way to micro-benchmark this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would run the existing micro benchmarks for document parsing, and check if they would detect regressions around the counting you added, or add a new one that's specific to this scenario if needed. We may actually need documents with an array including many inner objects that we likely don't have in the existing benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a micro-benchmarking to test the time it takes to parse documents with varying array
sizes.
The Run 1
is for he main branch whereRun 2
is for the current branch (update)
Array Size | Run 1 Avg (ns/op) | Run 2 Avg (ns/op) | Difference (ns) | % Change
--------------------------------------------------------------------------------
10 | 1983.711 | 2223.238 | +239.527 | +12.07%
100 | 15356.069 | 15497.797 | +141.728 | +0.92%
1000 | 141146.977 | 145791.315 | +4644.338 | +3.29%
10000 | 1432616.938 | 1443769.196 | +11152.258 | +0.78%
The performance degraded slightly in the second run for all array sizes. For larger array sizes (1000 and 10000), the performance remained quite stable.
For the array size of 10 we observe the largest relative increase though the absolute impact is small.
I executed the benchmark locally, aiming to minimize system load by stopping all resource-intensive processes. However, some system noise or garbage collection activity may have influenced the results.
The code I executed was the following
@Fork(value = 1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class DocumentParserBenchmark {
static {
LogConfigurator.loadLog4jPlugins();
LogConfigurator.configureESLogging();
}
@Param({"10", "100", "1000", "10000"})
private int arraySize;
private MapperService mapperService;
private SourceToParse sourceToParse;
@Setup
public void setUp() {
this.mapperService = MapperServiceFactory.create("""
{
"properties": {
"array": {
"properties": {
"value": { "type": "integer" }
}
}
}
}
\s""");
this.sourceToParse = new SourceToParse(UUIDs.randomBase64UUID(),
new BytesArray(generateDocWithArray(arraySize)), XContentType.JSON);
}
@Benchmark
public List<LuceneDocument> benchmarkDocumentWithLargeArray() {
return mapperService.documentMapper().parse(sourceToParse).docs();
}
private static String generateDocWithArray(int arraySize) {
StringBuilder sb = new StringBuilder();
sb.append("{ \"array\": [");
for (int i = 1; i <= arraySize; i++) {
sb.append("{\"value\": ").append(i).append("}");
if (i < arraySize) {
sb.append(",");
}
}
sb.append("]}");
return sb.toString();
}
}
For an index having an array defined as nested
Indexing is protected by the
index.mapping.nested_objects.limit
setting, which restricts the number of nested objects allowed per document. However, in cases where an index is created via dynamic mapping without predefined limits, no adequate control is in place. This can result in the ingestion of apoison document
—a document containing an array with a vast number of objects, which can severely degrade Elasticsearch cluster performance or even lead to node crashes due to excessive memory consumption and processing overhead.This is a preliminary
draft
that attempts to address the scenario by incorporating safeguards, enforcement of theindex.mapping.nested_objects.limit
setting, and additional validation checks during document parsing. The approach aims to mitigate the risk of ingesting poison documents by interleaving structural controls and runtime validation to limit the number of objects.