Skip to content

Check array size at parse time #129997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

drempapis
Copy link
Contributor

For an index having an array defined as nested

PUT /test-large-array
{
  "mappings": {
    "properties": {
      "array": {
        "type": "nested",           
        "properties": {
          "value": { "type": "integer" }
        }
      }
    }
  }
}

Indexing is protected by the index.mapping.nested_objects.limit setting, which restricts the number of nested objects allowed per document. However, in cases where an index is created via dynamic mapping without predefined limits, no adequate control is in place. This can result in the ingestion of a poison document—a document containing an array with a vast number of objects, which can severely degrade Elasticsearch cluster performance or even lead to node crashes due to excessive memory consumption and processing overhead.

This is a preliminary draft that attempts to address the scenario by incorporating safeguards, enforcement of the index.mapping.nested_objects.limit setting, and additional validation checks during document parsing. The approach aims to mitigate the risk of ingesting poison documents by interleaving structural controls and runtime validation to limit the number of objects.

@@ -749,8 +749,20 @@ private static void parseNonDynamicArray(
XContentParser.Token token;
XContentParser.Token previousToken = parser.currentToken();
int elements = 0;
int countArray = 0;
long nestedDocsLimit = context.indexSettings().getMappingNestedDocsLimit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if you are looking at the correct limit for the scenario you are covering. Could you review https://www.elastic.co/guide/en/elasticsearch/reference/8.18/mapping-settings-limit.html and its different limits please? The one you are checking is specific for the nested type, I am not mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this setting because it controls the number of elements allowed in arrays when the field is defined as a nested array. I couldn't find a more suitable existing one. If we follow this approach, should we create a new one?

PUT /index-large-array
{
  "mappings": {
    "properties": {
      "array": {
        "type": "nested",           
        "properties": {
          "value": { "type": "integer" }
        }
      }
    }
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I see with the setting you used is that it's specific to the nested type, re-used for something unrelated with your change. I am not sure that's a good idea. Maybe it deserves discussion with the team. An alternative is index.mapping.depth.limit but it's also not a great fit. Adding a new one is an option too.

Copy link
Contributor Author

@drempapis drempapis Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced a new setting, index.mapping.array_objects.limit, to enforce a limit on array size during parsing.

while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
if (token == XContentParser.Token.START_OBJECT) {
if (countArray++ >= nestedDocsLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this will be a performance hit when parsing documents perhaps. We should have micro benchmarks for document parsing under ./benchmarks. It is probably worth double checking the impact of this change.

Copy link
Contributor Author

@drempapis drempapis Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have micro benchmarks for document parsing under ./benchmarks.

What’s the recommended way to micro-benchmark this functionality

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would run the existing micro benchmarks for document parsing, and check if they would detect regressions around the counting you added, or add a new one that's specific to this scenario if needed. We may actually need documents with an array including many inner objects that we likely don't have in the existing benchmarks

Copy link
Contributor Author

@drempapis drempapis Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a micro-benchmarking to test the time it takes to parse documents with varying array sizes.
The Run 1 is for he main branch whereRun 2 is for the current branch (update)


Array Size | Run 1 Avg (ns/op) | Run 2 Avg (ns/op) | Difference (ns) | % Change
--------------------------------------------------------------------------------
10         |      1983.711     |      2223.238     |    +239.527    |   +12.07%
100        |     15356.069     |     15497.797     |    +141.728    |    +0.92%
1000       |    141146.977     |    145791.315     |   +4644.338    |    +3.29%
10000      |   1432616.938     |   1443769.196     |  +11152.258    |    +0.78%

The performance degraded slightly in the second run for all array sizes. For larger array sizes (1000 and 10000), the performance remained quite stable.

For the array size of 10 we observe the largest relative increase though the absolute impact is small.

I executed the benchmark locally, aiming to minimize system load by stopping all resource-intensive processes. However, some system noise or garbage collection activity may have influenced the results.

The code I executed was the following

@Fork(value = 1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class DocumentParserBenchmark {

    static {
        LogConfigurator.loadLog4jPlugins();
        LogConfigurator.configureESLogging();
    }

    @Param({"10", "100", "1000", "10000"})
    private int arraySize;

    private MapperService mapperService;

    private SourceToParse sourceToParse;

    @Setup
    public void setUp() {
        this.mapperService = MapperServiceFactory.create("""
            {
              "properties": {
                 "array": {
                    "properties": {
                       "value": { "type": "integer" }
                    }
                 }
              }
            }
            \s""");

        this.sourceToParse = new SourceToParse(UUIDs.randomBase64UUID(),
            new BytesArray(generateDocWithArray(arraySize)), XContentType.JSON);
    }

    @Benchmark
    public List<LuceneDocument> benchmarkDocumentWithLargeArray() {
        return mapperService.documentMapper().parse(sourceToParse).docs();
    }

    private static String generateDocWithArray(int arraySize) {
        StringBuilder sb = new StringBuilder();
        sb.append("{ \"array\": [");
        for (int i = 1; i <= arraySize; i++) {
            sb.append("{\"value\": ").append(i).append("}");
            if (i < arraySize) {
                sb.append(",");
            }
        }
        sb.append("]}");
        return sb.toString();
    }
}

@drempapis drempapis added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations labels Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants