First step optimizing tsdb doc values codec merging. #125403

martijnvg · 2025-03-21T12:58:30Z

The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sorting is performed (in order to get the doc ids of new segment in order of index sorting).

There are several reasons why the doc value instance is iterated multiple times:

To compute stats (num values, number of docs with value) required for writing values to disk.
To write bitset that indicate which documents have a value. (indexed disi, jump table)
To write the actual values to disk.
To write the addresses to disk (in case docs have multiple values)

This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.

This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes. Note this optimized merged is behind a feature flag for now.

The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated an expensive doc id sorting is performed (in order to get the doc ids in order of index sorting). There are several reasons why the doc value instance is iterated multiple times: * To compute stats (num values, number of docs with value) required for writing values to disk. * To write bitset that indicate which documents have a value. (indexed disi, jump table) * To write the actual values to disk. * To write the addresses to disk (in case docs have multiple values) This applies for numeric doc values, but also for the ordinals of sorted (set) doc values. This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes.

fixed sorted set dv added unit test with index sorting

martijnvg · 2025-03-21T18:11:32Z

The attached micro benchmark to tests the tsdb doc value codec with force merge suggests the following:

Benchmark                                                            (deltaTime)   (nDocs)  (seed)    Mode     Cnt     Score   Error  Units
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge                    1000  13431204      42  sample  322932     0.012 ± 0.037  ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.00              1000  13431204      42  sample            ≈ 10⁻⁴          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.50              1000  13431204      42  sample            ≈ 10⁻⁴          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.90              1000  13431204      42  sample            ≈ 10⁻³          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.95              1000  13431204      42  sample            ≈ 10⁻³          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.99              1000  13431204      42  sample             0.001          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.999             1000  13431204      42  sample             0.006          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p0.9999            1000  13431204      42  sample             0.031          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge:p1.00              1000  13431204      42  sample          3611.296          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge                 1000  13431204      42  sample  301539     0.014 ± 0.044  ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.00           1000  13431204      42  sample            ≈ 10⁻⁴          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.50           1000  13431204      42  sample            ≈ 10⁻⁴          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.90           1000  13431204      42  sample            ≈ 10⁻³          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.95           1000  13431204      42  sample            ≈ 10⁻³          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.99           1000  13431204      42  sample             0.001          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.999          1000  13431204      42  sample             0.006          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p0.9999         1000  13431204      42  sample             0.026          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge:p1.00           1000  13431204      42  sample          4060.086          ms/op

martijnvg · 2025-03-21T18:19:25Z

This PR adds a more code than I thought I needed, but the good thing is that the 'format 'code the writes to disk didn't need to be changed.

martijnvg · 2025-03-24T17:25:03Z

Running elastic/logs track (logging-indexing challenge) without this change as baseline and with this change as contender:

|                                                        Metric |                                   Task |        Baseline |       Contender |            Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|---------------------------------------:|----------------:|----------------:|----------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                                        |  1466.96        |  1376.13        |    -90.8323     |    min |   -6.19% |
|             Min cumulative indexing time across primary shard |                                        |    13.6549      |    13.2534      |     -0.40147    |    min |   -2.94% |
|          Median cumulative indexing time across primary shard |                                        |    51.722       |    49.3463      |     -2.37568    |    min |   -4.59% |
|             Max cumulative indexing time across primary shard |                                        |   530.91        |   516.047       |    -14.8631     |    min |   -2.80% |
|           Cumulative indexing throttle time of primary shards |                                        |     0           |     0           |      0          |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |                                        |     0           |     0           |      0          |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |                                        |     0           |     0           |      0          |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |                                        |     0           |     0           |      0          |    min |    0.00% |
|                       Cumulative merge time of primary shards |                                        |   372.98        |   368.189       |     -4.79062    |    min |   -1.28% |
|                      Cumulative merge count of primary shards |                                        |   287           |   316           |     29          |        |  +10.10% |
|                Min cumulative merge time across primary shard |                                        |     1.87335     |     1.73993     |     -0.13342    |    min |   -7.12% |
|             Median cumulative merge time across primary shard |                                        |     7.52116     |     7.99812     |      0.47696    |    min |   +6.34% |
|                Max cumulative merge time across primary shard |                                        |   198.019       |   183.153       |    -14.8651     |    min |   -7.51% |
|              Cumulative merge throttle time of primary shards |                                        |    95.1297      |    99.2514      |      4.12168    |    min |   +4.33% |
|       Min cumulative merge throttle time across primary shard |                                        |     0.312833    |     0.286733    |     -0.0261     |    min |   -8.34% |
|    Median cumulative merge throttle time across primary shard |                                        |     1.6415      |     1.56289     |     -0.07861    |    min |   -4.79% |
|       Max cumulative merge throttle time across primary shard |                                        |    46.661       |    45.5059      |     -1.15508    |    min |   -2.48% |
|                     Cumulative refresh time of primary shards |                                        |     6.74532     |     5.47937     |     -1.26595    |    min |  -18.77% |
|                    Cumulative refresh count of primary shards |                                        |  1823           |  1820           |     -3          |        |   -0.16% |
|              Min cumulative refresh time across primary shard |                                        |     0.0782167   |     0.147883    |      0.06967    |    min |  +89.07% |
|           Median cumulative refresh time across primary shard |                                        |     0.344267    |     0.25715     |     -0.08712    |    min |  -25.30% |
|              Max cumulative refresh time across primary shard |                                        |     1.79267     |     1.4992      |     -0.29347    |    min |  -16.37% |
|                       Cumulative flush time of primary shards |                                        |    85.7167      |    70.3317      |    -15.385      |    min |  -17.95% |
|                      Cumulative flush count of primary shards |                                        |  1761           |  1738           |    -23          |        |   -1.31% |
|                Min cumulative flush time across primary shard |                                        |     2.4082      |     1.96297     |     -0.44523    |    min |  -18.49% |
|             Median cumulative flush time across primary shard |                                        |     5.72419     |     4.48419     |     -1.24       |    min |  -21.66% |
|                Max cumulative flush time across primary shard |                                        |    13.6429      |    13.0937      |     -0.5492     |    min |   -4.03% |
|                                       Total Young Gen GC time |                                        |   134.348       |    97.011       |    -37.337      |      s |  -27.79% |
|                                      Total Young Gen GC count |                                        |  6484           |  6100           |   -384          |        |   -5.92% |
|                                         Total Old Gen GC time |                                        |     0           |     0           |      0          |      s |    0.00% |
|                                        Total Old Gen GC count |                                        |     0           |     0           |      0          |        |    0.00% |
|                                                    Store size |                                        |    44.6637      |    47.8919      |      3.22829    |     GB |   +7.23% |
|                                                 Translog size |                                        |     0.477316    |     0.609365    |      0.13205    |     GB |  +27.66% |
|                                        Heap used for segments |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                      Heap used for doc values |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                           Heap used for terms |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                           Heap used for norms |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                          Heap used for points |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                   Heap used for stored fields |                                        |     0           |     0           |      0          |     MB |    0.00% |
|                                                 Segment count |                                        |   525           |   582           |     57          |        |  +10.86% |
|                                   Total Ingest Pipeline count |                                        |     4.88641e+08 |     4.88617e+08 | -24000          |        |   -0.00% |
|                                    Total Ingest Pipeline time |                                        |     4.37256e+07 |     4.13417e+07 |     -2.3839e+06 |     ms |   -5.45% |
|                                  Total Ingest Pipeline failed |                                        |     0           |     0           |      0          |        |    0.00% |
|                                                Min Throughput |                       insert-pipelines |    12.8772      |    14.1695      |      1.29227    |  ops/s |  +10.04% |
|                                               Mean Throughput |                       insert-pipelines |    12.8772      |    14.1695      |      1.29227    |  ops/s |  +10.04% |
|                                             Median Throughput |                       insert-pipelines |    12.8772      |    14.1695      |      1.29227    |  ops/s |  +10.04% |
|                                                Max Throughput |                       insert-pipelines |    12.8772      |    14.1695      |      1.29227    |  ops/s |  +10.04% |
|                                      100th percentile latency |                       insert-pipelines |  1107.98        |   986.273       |   -121.702      |     ms |  -10.98% |
|                                 100th percentile service time |                       insert-pipelines |  1107.98        |   986.273       |   -121.702      |     ms |  -10.98% |
|                                                    error rate |                       insert-pipelines |     0           |     0           |      0          |      % |    0.00% |
|                                                Min Throughput |                             insert-ilm |    25.0239      |    27.0675      |      2.04365    |  ops/s |   +8.17% |
|                                               Mean Throughput |                             insert-ilm |    25.0239      |    27.0675      |      2.04365    |  ops/s |   +8.17% |
|                                             Median Throughput |                             insert-ilm |    25.0239      |    27.0675      |      2.04365    |  ops/s |   +8.17% |
|                                                Max Throughput |                             insert-ilm |    25.0239      |    27.0675      |      2.04365    |  ops/s |   +8.17% |
|                                      100th percentile latency |                             insert-ilm |    38.8762      |    35.8612      |     -3.01497    |     ms |   -7.76% |
|                                 100th percentile service time |                             insert-ilm |    38.8762      |    35.8612      |     -3.01497    |     ms |   -7.76% |
|                                                    error rate |                             insert-ilm |     0           |     0           |      0          |      % |    0.00% |
|                                                Min Throughput | validate-package-template-installation |    45.8409      |    49.1556      |      3.31475    |  ops/s |   +7.23% |
|                                               Mean Throughput | validate-package-template-installation |    45.8409      |    49.1556      |      3.31475    |  ops/s |   +7.23% |
|                                             Median Throughput | validate-package-template-installation |    45.8409      |    49.1556      |      3.31475    |  ops/s |   +7.23% |
|                                                Max Throughput | validate-package-template-installation |    45.8409      |    49.1556      |      3.31475    |  ops/s |   +7.23% |
|                                      100th percentile latency | validate-package-template-installation |    21.491       |    20.0744      |     -1.41665    |     ms |   -6.59% |
|                                 100th percentile service time | validate-package-template-installation |    21.491       |    20.0744      |     -1.41665    |     ms |   -6.59% |
|                                                    error rate | validate-package-template-installation |     0           |     0           |      0          |      % |    0.00% |
|                                                Min Throughput |        update-custom-package-templates |    27.9008      |    30.5407      |      2.63995    |  ops/s |   +9.46% |
|                                               Mean Throughput |        update-custom-package-templates |    27.9008      |    30.5407      |      2.63995    |  ops/s |   +9.46% |
|                                             Median Throughput |        update-custom-package-templates |    27.9008      |    30.5407      |      2.63995    |  ops/s |   +9.46% |
|                                                Max Throughput |        update-custom-package-templates |    27.9008      |    30.5407      |      2.63995    |  ops/s |   +9.46% |
|                                      100th percentile latency |        update-custom-package-templates |   429.812       |   392.63        |    -37.1817     |     ms |   -8.65% |
|                                 100th percentile service time |        update-custom-package-templates |   429.812       |   392.63        |    -37.1817     |     ms |   -8.65% |
|                                                    error rate |        update-custom-package-templates |     0           |     0           |      0          |      % |    0.00% |
|                                                Min Throughput |                             bulk-index |   892.867       |   525.49        |   -367.378      | docs/s |  -41.15% |
|                                               Mean Throughput |                             bulk-index | 56625.3         | 58508.5         |   1883.17       | docs/s |   +3.33% |
|                                             Median Throughput |                             bulk-index | 56797.8         | 58562.7         |   1764.92       | docs/s |   +3.11% |
|                                                Max Throughput |                             bulk-index | 57996.2         | 60975.4         |   2979.21       | docs/s |   +5.14% |
|                                       50th percentile latency |                             bulk-index |  1647.44        |   315.067       |  -1332.37       |     ms |  -80.88% |
|                                       90th percentile latency |                             bulk-index |  3106.41        |   609.538       |  -2496.87       |     ms |  -80.38% |
|                                       99th percentile latency |                             bulk-index |  5699.55        |  1121.14        |  -4578.41       |     ms |  -80.33% |
|                                     99.9th percentile latency |                             bulk-index |  8431.32        |  4875.32        |  -3556          |     ms |  -42.18% |
|                                    99.99th percentile latency |                             bulk-index | 11082           |  7170.7         |  -3911.33       |     ms |  -35.29% |
|                                      100th percentile latency |                             bulk-index | 15475           | 11908.2         |  -3566.86       |     ms |  -23.05% |
|                                  50th percentile service time |                             bulk-index |  1646.87        |   316.439       |  -1330.43       |     ms |  -80.79% |
|                                  90th percentile service time |                             bulk-index |  3105.91        |   612.673       |  -2493.24       |     ms |  -80.27% |
|                                  99th percentile service time |                             bulk-index |  5701.42        |  1092.73        |  -4608.69       |     ms |  -80.83% |
|                                99.9th percentile service time |                             bulk-index |  8423.32        |  4887.69        |  -3535.63       |     ms |  -41.97% |
|                               99.99th percentile service time |                             bulk-index | 11085.5         |  7215.36        |  -3870.11       |     ms |  -34.91% |
|                                 100th percentile service time |                             bulk-index | 15475           | 11908.2         |  -3566.86       |     ms |  -23.05% |
|                                                    error rate |                             bulk-index |     0           |     0           |      0          |      % |    0.00% |

dnhatn

I've left some comments. Thanks @martijnvg.

dnhatn · 2025-03-25T00:11:21Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/DocValuesConsumerUtil.java

+            sumNumDocsWithField += entry.numDocsWithField;
+        }
+
+        // Documents marked as deleted should be rare. Maybe in the case of noop operation?


Should we check this before getting docValues?

dnhatn · 2025-03-25T00:12:13Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/ES87TSDBDocValuesConsumer.java

+    @Override
+    public void mergeNumericField(FieldInfo mergeFieldInfo, MergeState mergeState) throws IOException {
+        var result = compatibleWithOptimizedMerge(enableOptimizedMerge, mergeFieldInfo, mergeState, (docValuesProducer) -> {
+            var numeric = docValuesProducer.getNumeric(mergeFieldInfo);


Should we query the NumericEntry directly?

if (docValuesProducer instanceof ES87TSDBDocValuesProducer producer && producer.version == VERSION_CURRENT) { var entry = producer.numerics.get(mergeFieldInfo.name); return new DocValuesConsumerUtil.FieldEntry(entry.docsWithFieldOffset, entry.numValues, -1); }

This is what I had initially, however the type of producer isn't ES87TSDBDocValuesProducer, but is PerFieldDocValuesFormat.FieldsReader. So I ended up with the current workaround, not happy about it, but I don't see another way.

dnhatn · 2025-03-25T01:33:50Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/DocValuesConsumerUtil.java

+        };
+    }
+
+    static NumericDocValues mergeNumericValues(List<NumericDocValuesSub> subs, boolean indexIsSorted) throws IOException {


Did we copy these from DocValuesConsumer? Is there any chance we can avoid copying this?

Yes. I don't see another way here. All the logic that we need here is private to DocValuesConsumer only.

martijnvg · 2025-03-25T13:28:40Z

Running tsdb track without this change as baseline and with this change as contender:

|                                                                   Metric |                    Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|-------------------------------------------------------------------------:|------------------------:|----------------:|----------------:|------------:|-------:|---------:|
|                               Cumulative indexing time of primary shards |                         |   267.277       |   272.379       |     5.10255 |    min |   +1.91% |
|                        Min cumulative indexing time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                     Median cumulative indexing time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                        Max cumulative indexing time across primary shard |                         |   267.277       |   272.379       |     5.10255 |    min |   +1.91% |
|                      Cumulative indexing throttle time of primary shards |                         |     0           |     0           |     0       |    min |    0.00% |
|               Min cumulative indexing throttle time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|            Median cumulative indexing throttle time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|               Max cumulative indexing throttle time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                                  Cumulative merge time of primary shards |                         |    88.0096      |    85.7457      |    -2.26392 |    min |   -2.57% |
|                                 Cumulative merge count of primary shards |                         |    60           |    59           |    -1       |        |   -1.67% |
|                           Min cumulative merge time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                        Median cumulative merge time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                           Max cumulative merge time across primary shard |                         |    88.0096      |    85.7457      |    -2.26392 |    min |   -2.57% |
|                         Cumulative merge throttle time of primary shards |                         |    15.859       |    14.9264      |    -0.9327  |    min |   -5.88% |
|                  Min cumulative merge throttle time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|               Median cumulative merge throttle time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                  Max cumulative merge throttle time across primary shard |                         |    15.859       |    14.9264      |    -0.9327  |    min |   -5.88% |
|                                Cumulative refresh time of primary shards |                         |     1.61935     |     1.67273     |     0.05338 |    min |   +3.30% |
|                               Cumulative refresh count of primary shards |                         |   163           |   162           |    -1       |        |   -0.61% |
|                         Min cumulative refresh time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                      Median cumulative refresh time across primary shard |                         |     0           |     0           |     0       |    min |    0.00% |
|                         Max cumulative refresh time across primary shard |                         |     1.61935     |     1.67273     |     0.05338 |    min |   +3.30% |
|                                  Cumulative flush time of primary shards |                         |     0.0152333   |     0.0009      |    -0.01433 |    min |  -94.09% |
|                                 Cumulative flush count of primary shards |                         |    14           |    13           |    -1       |        |   -7.14% |
|                           Min cumulative flush time across primary shard |                         |     3.33333e-05 |     3.33333e-05 |     0       |    min |    0.00% |
|                        Median cumulative flush time across primary shard |                         |     3.33333e-05 |     3.33333e-05 |     0       |    min |    0.00% |
|                           Max cumulative flush time across primary shard |                         |     0.0147667   |     0.000433333 |    -0.01433 |    min |  -97.07% |
|                                                  Total Young Gen GC time |                         |    36.007       |    37.219       |     1.212   |      s |   +3.37% |
|                                                 Total Young Gen GC count |                         |  1598           |  1602           |     4       |        |   +0.25% |
|                                                    Total Old Gen GC time |                         |     0           |     0           |     0       |      s |    0.00% |
|                                                   Total Old Gen GC count |                         |     0           |     0           |     0       |        |    0.00% |
|                                                               Store size |                         |     4.64955     |     4.67477     |     0.02522 |     GB |   +0.54% |
|                                                            Translog size |                         |     6.65896e-07 |     6.65896e-07 |     0       |     GB |    0.00% |
|                                                   Heap used for segments |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Heap used for doc values |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                                      Heap used for terms |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                                      Heap used for norms |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                                     Heap used for points |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                              Heap used for stored fields |                         |     0           |     0           |     0       |     MB |    0.00% |
|                                                            Segment count |                         |     6           |    13           |     7       |        | +116.67% |
|                                              Total Ingest Pipeline count |                         |     0           |     0           |     0       |        |    0.00% |
|                                               Total Ingest Pipeline time |                         |     0           |     0           |     0       |     ms |    0.00% |
|                                             Total Ingest Pipeline failed |                         |     0           |     0           |     0       |        |    0.00% |
|                                                           Min Throughput |                   index | 82125.6         | 80755.3         | -1370.32    | docs/s |   -1.67% |
|                                                          Mean Throughput |                   index | 87956.1         | 87051.1         |  -904.987   | docs/s |   -1.03% |
|                                                        Median Throughput |                   index | 88444.2         | 87573.7         |  -870.508   | docs/s |   -0.98% |
|                                                           Max Throughput |                   index | 93514.4         | 91517.8         | -1996.6     | docs/s |   -2.14% |
|                                                  50th percentile latency |                   index |   851.696       |   856.605       |     4.90901 |     ms |   +0.58% |
|                                                  90th percentile latency |                   index |  1113.73        |  1136.79        |    23.0587  |     ms |   +2.07% |
|                                                  99th percentile latency |                   index |  2823.29        |  2941.41        |   118.118   |     ms |   +4.18% |
|                                                99.9th percentile latency |                   index |  4246.65        |  4373.47        |   126.818   |     ms |   +2.99% |
|                                               99.99th percentile latency |                   index |  5733.52        |  5060.5         |  -673.019   |     ms |  -11.74% |
|                                                 100th percentile latency |                   index |  5980.77        |  5550.07        |  -430.702   |     ms |   -7.20% |
|                                             50th percentile service time |                   index |   851.235       |   856.613       |     5.37861 |     ms |   +0.63% |
|                                             90th percentile service time |                   index |  1122.35        |  1135.69        |    13.3398  |     ms |   +1.19% |
|                                             99th percentile service time |                   index |  2823.5         |  2934.71        |   111.212   |     ms |   +3.94% |
|                                           99.9th percentile service time |                   index |  4220.14        |  4372.69        |   152.55    |     ms |   +3.61% |
|                                          99.99th percentile service time |                   index |  5733.52        |  5060.5         |  -673.019   |     ms |  -11.74% |
|                                            100th percentile service time |                   index |  5980.77        |  5550.07        |  -430.702   |     ms |   -7.20% |
|                                                               error rate |                   index |     0           |     0           |     0       |      % |    0.00% |
|                                                           Min Throughput |                 default |    72.7266      |    63.9027      |    -8.82396 |  ops/s |  -12.13% |
|                                                          Mean Throughput |                 default |    72.7266      |    66.0422      |    -6.68449 |  ops/s |   -9.19% |
|                                                        Median Throughput |                 default |    72.7266      |    66.0422      |    -6.68449 |  ops/s |   -9.19% |
|                                                           Max Throughput |                 default |    72.7266      |    68.1816      |    -4.54502 |  ops/s |   -6.25% |
|                                                  50th percentile latency |                 default |    11.3524      |    12.8836      |     1.53121 |     ms |  +13.49% |
|                                                  90th percentile latency |                 default |    11.9501      |    13.4078      |     1.45768 |     ms |  +12.20% |
|                                                  99th percentile latency |                 default |    17.505       |    15.3226      |    -2.18244 |     ms |  -12.47% |
|                                                 100th percentile latency |                 default |    17.6197      |    16.5698      |    -1.04989 |     ms |   -5.96% |
|                                             50th percentile service time |                 default |    11.3524      |    12.8836      |     1.53121 |     ms |  +13.49% |
|                                             90th percentile service time |                 default |    11.9501      |    13.4078      |     1.45768 |     ms |  +12.20% |
|                                             99th percentile service time |                 default |    17.505       |    15.3226      |    -2.18244 |     ms |  -12.47% |
|                                            100th percentile service time |                 default |    17.6197      |    16.5698      |    -1.04989 |     ms |   -5.96% |
|                                                               error rate |                 default |     0           |     0           |     0       |      % |    0.00% |
|                                                           Min Throughput |              default_1k |    29.1361      |    27.9969      |    -1.1392  |  ops/s |   -3.91% |
|                                                          Mean Throughput |              default_1k |    29.5797      |    28.6573      |    -0.92236 |  ops/s |   -3.12% |
|                                                        Median Throughput |              default_1k |    29.6672      |    28.7824      |    -0.88483 |  ops/s |   -2.98% |
|                                                           Max Throughput |              default_1k |    29.8482      |    29.0676      |    -0.78057 |  ops/s |   -2.62% |
|                                                  50th percentile latency |              default_1k |    32.2199      |    33.0253      |     0.80531 |     ms |   +2.50% |
|                                                  90th percentile latency |              default_1k |    33.0007      |    33.6412      |     0.64048 |     ms |   +1.94% |
|                                                  99th percentile latency |              default_1k |    51.6371      |    38.0896      |   -13.5476  |     ms |  -26.24% |
|                                                 100th percentile latency |              default_1k |    54.7434      |    48.1042      |    -6.63918 |     ms |  -12.13% |
|                                             50th percentile service time |              default_1k |    32.2199      |    33.0253      |     0.80531 |     ms |   +2.50% |
|                                             90th percentile service time |              default_1k |    33.0007      |    33.6412      |     0.64048 |     ms |   +1.94% |
|                                             99th percentile service time |              default_1k |    51.6371      |    38.0896      |   -13.5476  |     ms |  -26.24% |
|                                            100th percentile service time |              default_1k |    54.7434      |    48.1042      |    -6.63918 |     ms |  -12.13% |
|                                                               error rate |              default_1k |     0           |     0           |     0       |      % |    0.00% |
|                                                           Min Throughput | date-histo-entire-range |   317.361       |   319.564       |     2.20343 |  ops/s |   +0.69% |
|                                                          Mean Throughput | date-histo-entire-range |   317.361       |   319.564       |     2.20343 |  ops/s |   +0.69% |
|                                                        Median Throughput | date-histo-entire-range |   317.361       |   319.564       |     2.20343 |  ops/s |   +0.69% |
|                                                           Max Throughput | date-histo-entire-range |   317.361       |   319.564       |     2.20343 |  ops/s |   +0.69% |
|                                                  50th percentile latency | date-histo-entire-range |     2.66948     |     2.59981     |    -0.06967 |     ms |   -2.61% |
|                                                  90th percentile latency | date-histo-entire-range |     2.94906     |     2.74951     |    -0.19955 |     ms |   -6.77% |
|                                                  99th percentile latency | date-histo-entire-range |     3.67115     |     3.05257     |    -0.61857 |     ms |  -16.85% |
|                                                 100th percentile latency | date-histo-entire-range |     3.7502      |     3.14106     |    -0.60914 |     ms |  -16.24% |
|                                             50th percentile service time | date-histo-entire-range |     2.66948     |     2.59981     |    -0.06967 |     ms |   -2.61% |
|                                             90th percentile service time | date-histo-entire-range |     2.94906     |     2.74951     |    -0.19955 |     ms |   -6.77% |
|                                             99th percentile service time | date-histo-entire-range |     3.67115     |     3.05257     |    -0.61857 |     ms |  -16.85% |
|                                            100th percentile service time | date-histo-entire-range |     3.7502      |     3.14106     |    -0.60914 |     ms |  -16.24% |
|                                                               error rate | date-histo-entire-range |     0           |     0           |     0       |      % |    0.00% |
|                                                           Min Throughput |          esql-fetch-500 |     8.95722     |     9.22561     |     0.26839 |  ops/s |   +3.00% |
|                                                          Mean Throughput |          esql-fetch-500 |     9.60207     |     9.79872     |     0.19664 |  ops/s |   +2.05% |
|                                                        Median Throughput |          esql-fetch-500 |     9.65652     |     9.85814     |     0.20162 |  ops/s |   +2.09% |
|                                                           Max Throughput |          esql-fetch-500 |    10.0294      |    10.1854      |     0.156   |  ops/s |   +1.56% |
|                                                  50th percentile latency |          esql-fetch-500 |    90.0991      |    89.3751      |    -0.72395 |     ms |   -0.80% |
|                                                  90th percentile latency |          esql-fetch-500 |    97.2384      |    95.5091      |    -1.72926 |     ms |   -1.78% |
|                                                  99th percentile latency |          esql-fetch-500 |   108.207       |   122.953       |    14.7463  |     ms |  +13.63% |
|                                                 100th percentile latency |          esql-fetch-500 |   110.563       |   125.449       |    14.8864  |     ms |  +13.46% |
|                                             50th percentile service time |          esql-fetch-500 |    90.0991      |    89.3751      |    -0.72395 |     ms |   -0.80% |
|                                             90th percentile service time |          esql-fetch-500 |    97.2384      |    95.5091      |    -1.72926 |     ms |   -1.78% |
|                                             99th percentile service time |          esql-fetch-500 |   108.207       |   122.953       |    14.7463  |     ms |  +13.63% |
|                                            100th percentile service time |          esql-fetch-500 |   110.563       |   125.449       |    14.8864  |     ms |  +13.46% |
|                                                               error rate |          esql-fetch-500 |     0           |     0           |     0       |      % |    0.00% |

Unfortunately the improvement is less visible here. Looks like ~3% less time spent on merging.

martijnvg · 2025-03-25T14:18:29Z

Thanks Nhat for the review!

…ctly in compatibleWithOptimizedMerge(...) method.

dnhatn

LGTM. Thanks Martijn!

martijnvg · 2025-04-08T10:54:54Z

Unfortunately it isn't possible to run the release tests due to:

* What went wrong:
Could not determine the dependencies of task ':distribution:docker:buildAarch64FipsDockerContext'.
> Could not resolve all dependencies for configuration ':distribution:docker:metricbeat_fips_aarch64'.
   > Could not find beats:metricbeat-fips:9.1.0.
     Required by:
         project :distribution:docker

I did confirm that locally the unit tests pass with release build (meaning the feature flag is disabled):

./gradlew ":server:test" --tests "org.elasticsearch.index.codec.*" -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dlicense.key=x-pack/license-tools/src/test/resources/public.key

elasticsearchmachine · 2025-04-09T05:51:41Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 125403

Build jump table (disi) when iterating over SortedNumericDocValues, instead of separately iterating over SortedNumericDocValues. In case when indexing sorting is active, this requires an additional merge sort. Follow up from elastic#125403

…5933) The change contains the following changes: - The numDocsWithField field moved from SortedNumericEntry to NumericEntry. Making this statistic always available. - Store jump table after values in ES87TSDBDocValuesConsumer#writeField(...). Currently it is stored before storing values. This will allow us later to iterate over the SortedNumericDocValues once. When merging, this is expensive as a merge sort on the fly is being executed. This change will allow all the optimizations that are listed in elastic#125403

Backporting elastic#125403 to the 8.x branch. The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sorting is performed (in order to get the doc ids of new segment in order of index sorting). There are several reasons why the doc value instance is iterated multiple times: * To compute stats (num values, number of docs with value) required for writing values to disk. * To write bitset that indicate which documents have a value. (indexed disi, jump table) * To write the actual values to disk. * To write the addresses to disk (in case docs have multiple values) This applies for numeric doc values, but also for the ordinals of sorted (set) doc values. This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes. Note this optimized merged is behind a feature flag for now.

* [8.x] First step optimizing tsdb doc values codec merging. Backporting #125403 to the 8.x branch. The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sorting is performed (in order to get the doc ids of new segment in order of index sorting). There are several reasons why the doc value instance is iterated multiple times: * To compute stats (num values, number of docs with value) required for writing values to disk. * To write bitset that indicate which documents have a value. (indexed disi, jump table) * To write the actual values to disk. * To write the addresses to disk (in case docs have multiple values) This applies for numeric doc values, but also for the ordinals of sorted (set) doc values. This PR addresses solving the first reason why doc value instance needs to be iterated. This is done only when in case of merging and when the segments to be merged with are also of type es87 doc values, codec version is the same and there are no deletes. Note this optimized merged is behind a feature flag for now. * fixed compile errors in benchmark * Fix DocValuesConsumerUtil (#126836) The compatibleWithOptimizedMerge() method doesn't handle codec readers that are wrapped by our source pruning filter codec reader. This change addresses that. Failing to detect this means that the optimized merge will not kick in.

Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table. In case when indexing sorting is active, this requires an additional merge sort. Follow up from #125403

Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table. In case when indexing sorting is active, this requires an additional merge sort. Follow up from elastic#125403

Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table. In case when indexing sorting is active, this requires an additional merge sort. Follow up from #125403

elasticsearchmachine added the v9.1.0 label Mar 21, 2025

elasticsearchmachine and others added 2 commits March 21, 2025 13:10

[CI] Auto commit changes from spotless

9bd2907

actually use OrdinalMap when merging sorted and sorted dv

65d97e5

fixed sorted set dv added unit test with index sorting

martijnvg force-pushed the mergeSortedNumericField_3 branch from 52f3084 to 65d97e5 Compare March 21, 2025 15:50

martijnvg and others added 4 commits March 21, 2025 17:13

fix test

7369a22

[CI] Auto commit changes from spotless

3b7822d

fix test (2)

ce4b326

fix lost of stuff

486ea20

martijnvg added 5 commits March 21, 2025 20:16

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

16c0a00

iter

984513a

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

5a575d6

iter test

3b53705

moving code around

9fb38b6

dnhatn reviewed Mar 25, 2025

View reviewed changes

martijnvg added 2 commits March 25, 2025 10:05

benchmark iter

1e0e2f8

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

65741c4

Check for deleted docs before getting doc value instances.

1ec6308

martijnvg added 6 commits March 25, 2025 15:30

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

ccae570

remove doc value skipper check

5e7cc11

Remove getEntryFunction lamda and delegate to doc value instance dire…

744a665

…ctly in compatibleWithOptimizedMerge(...) method.

lower doc count in benchmark

176fac7

added node setting to control whether optimized merge is enabled.

ec998a3

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

5425079

martijnvg added the :StorageEngine/Codec label Mar 25, 2025

martijnvg added 4 commits April 5, 2025 09:04

Remove BaseNumericDocValues and BaseSortedNumericDocValues

39dc98f

improve TsdbDocValueBwcTests

5bfb302

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

b641e01

Assert per field format field info attributes.

5e8ea42

dnhatn approved these changes Apr 7, 2025

View reviewed changes

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

1c4efa6

martijnvg added auto-backport Automatically create backport pull requests when merged test-full-bwc Trigger full BWC version matrix tests test-release Trigger CI checks against release build labels Apr 8, 2025

martijnvg added 2 commits April 8, 2025 09:54

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

57e2996

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

b5f332b

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

66c7efb

martijnvg removed the test-release Trigger CI checks against release build label Apr 8, 2025

martijnvg added 2 commits April 8, 2025 16:20

Merge remote-tracking branch 'es/main' into mergeSortedNumericField_3

86b4d22

move per field dv code to dedicated package

a44ab59

martijnvg merged commit 065c583 into elastic:main Apr 9, 2025
17 checks passed

elasticsearchmachine added the backport pending label Apr 9, 2025

martijnvg mentioned this pull request Apr 9, 2025

Tsdb doc values inline building jump table #126499

Merged

martijnvg mentioned this pull request Apr 15, 2025

Coalesce getSortedNumeric calls for ES819 doc values merging #126732

Merged

martijnvg mentioned this pull request Apr 15, 2025

[8.x] First step optimizing tsdb doc values codec merging. #126827

Merged

martijnvg removed the backport pending label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First step optimizing tsdb doc values codec merging. #125403

First step optimizing tsdb doc values codec merging. #125403

martijnvg commented Mar 21, 2025 •

edited

Loading

martijnvg commented Mar 21, 2025

martijnvg commented Mar 21, 2025

martijnvg commented Mar 24, 2025

dnhatn left a comment

dnhatn Mar 25, 2025

dnhatn Mar 25, 2025

martijnvg Mar 25, 2025

dnhatn Mar 25, 2025

martijnvg Mar 25, 2025

martijnvg commented Mar 25, 2025

martijnvg commented Mar 25, 2025

dnhatn left a comment

martijnvg commented Apr 8, 2025

elasticsearchmachine commented Apr 9, 2025

First step optimizing tsdb doc values codec merging. #125403

First step optimizing tsdb doc values codec merging. #125403

Conversation

martijnvg commented Mar 21, 2025 • edited Loading

martijnvg commented Mar 21, 2025

martijnvg commented Mar 21, 2025

martijnvg commented Mar 24, 2025

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn Mar 25, 2025

Choose a reason for hiding this comment

dnhatn Mar 25, 2025

Choose a reason for hiding this comment

martijnvg Mar 25, 2025

Choose a reason for hiding this comment

dnhatn Mar 25, 2025

Choose a reason for hiding this comment

martijnvg Mar 25, 2025

Choose a reason for hiding this comment

martijnvg commented Mar 25, 2025

martijnvg commented Mar 25, 2025

dnhatn left a comment

Choose a reason for hiding this comment

martijnvg commented Apr 8, 2025

elasticsearchmachine commented Apr 9, 2025

💔 Backport failed

martijnvg commented Mar 21, 2025 •

edited

Loading