Skip to content

Skip UTF8 to UTF16 conversion during document indexing #126492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

jordan-powers
Copy link
Contributor

@jordan-powers jordan-powers commented Apr 8, 2025

Initial implementation of avoiding the UTF8 to UTF16 back to UTF8 conversion that currently happens when parsing JSON.

Currently, the optimization only applies for ASCII-only strings that do not contain escaped characters (f.e. \").

Also, the optimization is currently only implemented for keyword fields.

The core logic to return the underlying bytes is implemented in ESUTF8StreamJsonParser. The other "ES" versions of Jackson classes are just the glue logic required to make Jackson create instances of ESUTF8StreamJsonParser instead of the superclass UTF8StreamJsonParser.


I ran the prototype in esbench using the logging-indexing challenge on a standard-mode index, and I got these results:

Baseline:

|                                                         Metric |                                   Task |           Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   855.769       |    min |
|             Min cumulative indexing time across primary shards |                                        |     3.16602     |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.4853      |    min |
|             Max cumulative indexing time across primary shards |                                        |   142.816       |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0           |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0           |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|                        Cumulative merge time of primary shards |                                        |   610.819       |    min |
|                       Cumulative merge count of primary shards |                                        |  1018           |        |
|                Min cumulative merge time across primary shards |                                        |     0.395583    |    min |
|             Median cumulative merge time across primary shards |                                        |     3.37862     |    min |
|                Max cumulative merge time across primary shards |                                        |   142.732       |    min |
|               Cumulative merge throttle time of primary shards |                                        |   282.793       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.14925     |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     1.39443     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    63.0728      |    min |
|                      Cumulative refresh time of primary shards |                                        |    58.5789      |    min |
|                     Cumulative refresh count of primary shards |                                        | 19551           |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0121333   |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0492833   |    min |
|              Max cumulative refresh time across primary shards |                                        |    19.3951      |    min |
|                        Cumulative flush time of primary shards |                                        |    72.752       |    min |
|                       Cumulative flush count of primary shards |                                        |  7713           |        |
|                Min cumulative flush time across primary shards |                                        |     0.47355     |    min |
|             Median cumulative flush time across primary shards |                                        |     1.58855     |    min |
|                Max cumulative flush time across primary shards |                                        |     5.32642     |    min |

Contender:

|                                                         Metric |                                   Task |          Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|---------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   724.972      |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.45815    |    min |
|          Median cumulative indexing time across primary shards |                                        |     9.29517    |    min |
|             Max cumulative indexing time across primary shards |                                        |   127          |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0          |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0          |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|                        Cumulative merge time of primary shards |                                        |   611.619      |    min |
|                       Cumulative merge count of primary shards |                                        |  1135          |        |
|                Min cumulative merge time across primary shards |                                        |     0.991983   |    min |
|             Median cumulative merge time across primary shards |                                        |     3.72563    |    min |
|                Max cumulative merge time across primary shards |                                        |   144.073      |    min |
|               Cumulative merge throttle time of primary shards |                                        |   277.04       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.426367   |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     1.7056     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    65.306      |    min |
|                      Cumulative refresh time of primary shards |                                        |    62.5784     |    min |
|                     Cumulative refresh count of primary shards |                                        | 18632          |        |
|              Min cumulative refresh time across primary shards |                                        |     0.01085    |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0686167  |    min |
|              Max cumulative refresh time across primary shards |                                        |    19.9901     |    min |
|                        Cumulative flush time of primary shards |                                        |    69.0186     |    min |
|                       Cumulative flush count of primary shards |                                        |  6940          |        |
|                Min cumulative flush time across primary shards |                                        |     0.448817   |    min |
|             Median cumulative flush time across primary shards |                                        |     1.4913     |    min |
|                Max cumulative flush time across primary shards |                                        |     4.95223    |    min |

@jordan-powers
Copy link
Contributor Author

Ok, I figured out a way to avoid cloning the file. Basically the problem was that detecting the encoding may possibly consume some bytes from the beginning of the input if there is a BOM present, and the property keeping track of how many bytes had been consumed was internal to ByteSourceJsonBootstrapper and not visible. But I figured out that I can use ByteSourceJsonBootstrapper to detect the encoding, then if it's UTF-8, I can manually check for and skip over the BOM.

Also, looking at the API docs, it seems elasticsearch only supports UTF-8 inputs anyway. So maybe I don't need to check the character encoding? Although I don't know for certain that elasticsearch won't parse json from another source (such as a configuration file or something) that's not UTF-8 encoded, so I'll leave it in

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Jordan! Happy to see we don't need to fork entire files to get this.

Did you also see an increase in bulk indexing median throughput when running the elastic/logs track?

/**
* Class that holds either a UTF-16 String or a UTF-8 BytesRef, and lazily converts between the two.
*/
private static class RawString {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but it would be a bit tough since records are immutable. The lazy conversion works by only setting the bytesValue or stringValue once they've been converted, which wouldn't work if the properties were final.

Copy link
Member

@martijnvg martijnvg Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I falsely assumed that making RawString's properties final would be straightforward. In that case turning this class into a record doesn't make sense.

/**
* Method that will try to get underlying UTF-8 encoded bytes of the current string token.
* This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null.
* Currently, this is only implemented for ascii-only strings that do not contain escaped characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we need to support escaped characters? I don't think we have a test for this. Maybe a rest test (yaml or java) that tries to index a escaped json string field using keyword field mapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll definitely add a test.

I was thinking that support for escaped characters and/or possibly full unicode (not just ascii) could happen in a follow-up after this is merged. The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

That makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.

@jordan-powers
Copy link
Contributor Author

Did you also see an increase in bulk indexing median throughput when running the elastic/logs track?

No it was actually 0.27% slower, but that's such a small change I think it's noise. Latency was much lower though.

I plan to rerun the benchmarks with logsdb. Hopefully that shows some improvement to throughput.

Baseline:

|                  Min Throughput |                             bulk-index |  1133.11        | docs/s |
|                 Mean Throughput |                             bulk-index | 22013.5         | docs/s |
|               Median Throughput |                             bulk-index | 21980.4         | docs/s |
|                  Max Throughput |                             bulk-index | 23748.9         | docs/s |
|         50th percentile latency |                             bulk-index |  2566.75        |     ms |
|         90th percentile latency |                             bulk-index |  3637.47        |     ms |
|         99th percentile latency |                             bulk-index |  4794.34        |     ms |
|       99.9th percentile latency |                             bulk-index |  5885.33        |     ms |
|      99.99th percentile latency |                             bulk-index |  7442.52        |     ms |
|        100th percentile latency |                             bulk-index |  8522.51        |     ms |
|    50th percentile service time |                             bulk-index |  2564.74        |     ms |
|    90th percentile service time |                             bulk-index |  3637.85        |     ms |
|    99th percentile service time |                             bulk-index |  4799.66        |     ms |
|  99.9th percentile service time |                             bulk-index |  5910.21        |     ms |
| 99.99th percentile service time |                             bulk-index |  7434.99        |     ms |
|   100th percentile service time |                             bulk-index |  8522.51        |     ms |
|                      error rate |                             bulk-index |     0           |      % |

Contender:

|                  Min Throughput |                             bulk-index |   563.25       | docs/s |
|                 Mean Throughput |                             bulk-index | 21945.6        | docs/s |
|               Median Throughput |                             bulk-index | 21920.7        | docs/s |
|                  Max Throughput |                             bulk-index | 23439.9        | docs/s |
|         50th percentile latency |                             bulk-index |   493.49       |     ms |
|         90th percentile latency |                             bulk-index |   757.997      |     ms |
|         99th percentile latency |                             bulk-index |  1166.96       |     ms |
|       99.9th percentile latency |                             bulk-index |  2113.26       |     ms |
|      99.99th percentile latency |                             bulk-index |  2723.74       |     ms |
|        100th percentile latency |                             bulk-index |  3567.55       |     ms |
|    50th percentile service time |                             bulk-index |   492.466      |     ms |
|    90th percentile service time |                             bulk-index |   757.44       |     ms |
|    99th percentile service time |                             bulk-index |  1163.02       |     ms |
|  99.9th percentile service time |                             bulk-index |  2109.98       |     ms |
| 99.99th percentile service time |                             bulk-index |  2719.76       |     ms |
|   100th percentile service time |                             bulk-index |  3567.55       |     ms |
|                      error rate |                             bulk-index |     0          |      % |

@jordan-powers
Copy link
Contributor Author

Ok, ran another benchmark. This time a multi-node logsdb run, similar to the nightlies.

It seems that in this run, cumulative indexing time, latency, service time, and indexing throughput all increased.

I'm not sure why indexing time, latency, and service time increased. Maybe those values are just noisy?
But the median indexing throughput increased significantly by ~9%, which is a good sign.

Baseline:

|                                                         Metric |                                   Task |          Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|---------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   865.972      |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.79818    |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.1573     |    min |
|             Max cumulative indexing time across primary shards |                                        |   159.704      |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0          |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0          |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|                        Cumulative merge time of primary shards |                                        |   415.579      |    min |
|                       Cumulative merge count of primary shards |                                        |  1316          |        |
|                Min cumulative merge time across primary shards |                                        |     0.556283   |    min |
|             Median cumulative merge time across primary shards |                                        |     2.5208     |    min |
|                Max cumulative merge time across primary shards |                                        |   101.172      |    min |
|               Cumulative merge throttle time of primary shards |                                        |   118.28       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.14245    |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     0.527983   |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    28.2791     |    min |
|                      Cumulative refresh time of primary shards |                                        |    86.7655     |    min |
|                     Cumulative refresh count of primary shards |                                        | 14976          |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0321333  |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.146817   |    min |
|              Max cumulative refresh time across primary shards |                                        |    29.7167     |    min |
|                        Cumulative flush time of primary shards |                                        |   108.617      |    min |
|                       Cumulative flush count of primary shards |                                        |  6216          |        |
|                Min cumulative flush time across primary shards |                                        |     0.69835    |    min |
|             Median cumulative flush time across primary shards |                                        |     2.60913    |    min |
|                Max cumulative flush time across primary shards |                                        |     7.87545    |    min |
|                                        Total Young Gen GC time |                                        |   202.142      |      s |
|                                       Total Young Gen GC count |                                        |  8909          |        |
|                                          Total Old Gen GC time |                                        |     0          |      s |
|                                         Total Old Gen GC count |                                        |     0          |        |
|                                                   Dataset size |                                        |    73.4386     |     GB |
|                                                     Store size |                                        |    73.4386     |     GB |
|                                                  Translog size |                                        |     3.03224    |     GB |
|                                         Heap used for segments |                                        |     0          |     MB |
|                                       Heap used for doc values |                                        |     0          |     MB |
|                                            Heap used for terms |                                        |     0          |     MB |
|                                            Heap used for norms |                                        |     0          |     MB |
|                                           Heap used for points |                                        |     0          |     MB |
|                                    Heap used for stored fields |                                        |     0          |     MB |
|                                                  Segment count |                                        |  1628          |        |
|                                    Total Ingest Pipeline count |                                        |     4.8861e+08 |        |
|                                     Total Ingest Pipeline time |                                        | 29324.6        |      s |
|                                   Total Ingest Pipeline failed |                                        |     0          |        |
|                                                 Min Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                                Mean Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                              Median Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                                 Max Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                       100th percentile latency |                       insert-pipelines |  2159.06       |     ms |
|                                  100th percentile service time |                       insert-pipelines |  2159.06       |     ms |
|                                                     error rate |                       insert-pipelines |     0          |      % |
|                                                 Min Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                                Mean Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                              Median Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                                 Max Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                       100th percentile latency |                             insert-ilm |    49.4519     |     ms |
|                                  100th percentile service time |                             insert-ilm |    49.4519     |     ms |
|                                                     error rate |                             insert-ilm |     0          |      % |
|                                                 Min Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                                Mean Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                              Median Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                                 Max Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                       100th percentile latency | validate-package-template-installation |    18.8955     |     ms |
|                                  100th percentile service time | validate-package-template-installation |    18.8955     |     ms |
|                                                     error rate | validate-package-template-installation |     0          |      % |
|                                                 Min Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                                Mean Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                              Median Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                                 Max Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                       100th percentile latency |        update-custom-package-templates |   532.922      |     ms |
|                                  100th percentile service time |        update-custom-package-templates |   532.922      |     ms |
|                                                     error rate |        update-custom-package-templates |     0          |      % |
|                                                 Min Throughput |                             bulk-index |   531.71       | docs/s |
|                                                Mean Throughput |                             bulk-index | 28977.3        | docs/s |
|                                              Median Throughput |                             bulk-index | 28876.9        | docs/s |
|                                                 Max Throughput |                             bulk-index | 30829.5        | docs/s |
|                                        50th percentile latency |                             bulk-index |   352.035      |     ms |
|                                        90th percentile latency |                             bulk-index |   657.696      |     ms |
|                                        99th percentile latency |                             bulk-index |  1107.89       |     ms |
|                                      99.9th percentile latency |                             bulk-index |  1804.35       |     ms |
|                                     99.99th percentile latency |                             bulk-index |  3264.32       |     ms |
|                                       100th percentile latency |                             bulk-index |  6313.07       |     ms |
|                                   50th percentile service time |                             bulk-index |   351.93       |     ms |
|                                   90th percentile service time |                             bulk-index |   659.242      |     ms |
|                                   99th percentile service time |                             bulk-index |  1107.04       |     ms |
|                                 99.9th percentile service time |                             bulk-index |  1809.02       |     ms |
|                                99.99th percentile service time |                             bulk-index |  3267.62       |     ms |
|                                  100th percentile service time |                             bulk-index |  6313.07       |     ms |
|                                                     error rate |                             bulk-index |     0          |      % |

Contender:

|                                                         Metric |                                   Task |           Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   899.486       |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.95095     |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.6404      |    min |
|             Max cumulative indexing time across primary shards |                                        |   158.792       |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0           |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0           |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|                        Cumulative merge time of primary shards |                                        |   426.511       |    min |
|                       Cumulative merge count of primary shards |                                        |  1024           |        |
|                Min cumulative merge time across primary shards |                                        |     0.512583    |    min |
|             Median cumulative merge time across primary shards |                                        |     2.31557     |    min |
|                Max cumulative merge time across primary shards |                                        |   114.794       |    min |
|               Cumulative merge throttle time of primary shards |                                        |   113.117       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.0915333   |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     0.58525     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    27.3451      |    min |
|                      Cumulative refresh time of primary shards |                                        |    89.7276      |    min |
|                     Cumulative refresh count of primary shards |                                        | 14144           |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0293833   |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0993667   |    min |
|              Max cumulative refresh time across primary shards |                                        |    29.8377      |    min |
|                        Cumulative flush time of primary shards |                                        |   111.547       |    min |
|                       Cumulative flush count of primary shards |                                        |  6206           |        |
|                Min cumulative flush time across primary shards |                                        |     0.713667    |    min |
|             Median cumulative flush time across primary shards |                                        |     2.71242     |    min |
|                Max cumulative flush time across primary shards |                                        |     7.9553      |    min |
|                                        Total Young Gen GC time |                                        |   253.605       |      s |
|                                       Total Young Gen GC count |                                        |  8500           |        |
|                                          Total Old Gen GC time |                                        |     0           |      s |
|                                         Total Old Gen GC count |                                        |     0           |        |
|                                                   Dataset size |                                        |    71.7906      |     GB |
|                                                     Store size |                                        |    71.7906      |     GB |
|                                                  Translog size |                                        |     3.85838     |     GB |
|                                         Heap used for segments |                                        |     0           |     MB |
|                                       Heap used for doc values |                                        |     0           |     MB |
|                                            Heap used for terms |                                        |     0           |     MB |
|                                            Heap used for norms |                                        |     0           |     MB |
|                                           Heap used for points |                                        |     0           |     MB |
|                                    Heap used for stored fields |                                        |     0           |     MB |
|                                                  Segment count |                                        |  1350           |        |
|                                    Total Ingest Pipeline count |                                        |     4.88622e+08 |        |
|                                     Total Ingest Pipeline time |                                        | 30250.2         |      s |
|                                   Total Ingest Pipeline failed |                                        |     0           |        |
|                                                 Min Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                                Mean Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                              Median Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                                 Max Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                       100th percentile latency |                       insert-pipelines |  1911.34        |     ms |
|                                  100th percentile service time |                       insert-pipelines |  1911.34        |     ms |
|                                                     error rate |                       insert-pipelines |     0           |      % |
|                                                 Min Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                                Mean Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                              Median Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                                 Max Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                       100th percentile latency |                             insert-ilm |    47.5897      |     ms |
|                                  100th percentile service time |                             insert-ilm |    47.5897      |     ms |
|                                                     error rate |                             insert-ilm |     0           |      % |
|                                                 Min Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                                Mean Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                              Median Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                                 Max Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                       100th percentile latency | validate-package-template-installation |    14.7938      |     ms |
|                                  100th percentile service time | validate-package-template-installation |    14.7938      |     ms |
|                                                     error rate | validate-package-template-installation |     0           |      % |
|                                                 Min Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                                Mean Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                              Median Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                                 Max Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                       100th percentile latency |        update-custom-package-templates |   500.494       |     ms |
|                                  100th percentile service time |        update-custom-package-templates |   500.494       |     ms |
|                                                     error rate |        update-custom-package-templates |     0           |      % |
|                                                 Min Throughput |                             bulk-index |  1290.94        | docs/s |
|                                                Mean Throughput |                             bulk-index | 31518.1         | docs/s |
|                                              Median Throughput |                             bulk-index | 31546.7         | docs/s |
|                                                 Max Throughput |                             bulk-index | 33875.7         | docs/s |
|                                        50th percentile latency |                             bulk-index |  1685.81        |     ms |
|                                        90th percentile latency |                             bulk-index |  3011.78        |     ms |
|                                        99th percentile latency |                             bulk-index |  4668.77        |     ms |
|                                      99.9th percentile latency |                             bulk-index |  6193.05        |     ms |
|                                     99.99th percentile latency |                             bulk-index |  7657.43        |     ms |
|                                       100th percentile latency |                             bulk-index |  9168.88        |     ms |
|                                   50th percentile service time |                             bulk-index |  1692.69        |     ms |
|                                   90th percentile service time |                             bulk-index |  2995.72        |     ms |
|                                   99th percentile service time |                             bulk-index |  4659.98        |     ms |
|                                 99.9th percentile service time |                             bulk-index |  6193.08        |     ms |
|                                99.99th percentile service time |                             bulk-index |  7660.23        |     ms |
|                                  100th percentile service time |                             bulk-index |  9168.88        |     ms |
|                                                     error rate |                             bulk-index |     0           |      % |

@jordan-powers jordan-powers changed the title Prototype to skip UTF8 to UTF16 conversion Skip UTF8 to UTF16 conversion during document indexing Apr 18, 2025
@jordan-powers jordan-powers marked this pull request as ready for review April 18, 2025 20:12
@jordan-powers jordan-powers requested a review from a team as a code owner April 18, 2025 20:12
@jordan-powers
Copy link
Contributor Author

Ok I merged XBytesRef and EncodedString into a single XContentString. I started working on updating XContentParser#text() to return XContentString, but (unsurprisingly) XContentParser#text() is used everywhere so that's a pretty big change. Instead I added a separate XContentParser#xContentText(). If we want to update XContentParser#text() to return XContentString, I think that should be a separate follow-up so that it doesn't overwhelm this PR.

@ldematte ldematte self-requested a review April 24, 2025 06:47
@jordan-powers
Copy link
Contributor Author

So I just noticed that we have the class org.elasticsearch.common.Text which is basically the same thing as theXContentString that I added. But it's in the :server project, while I would need to use it in the :libs:xcontent and :libs:xcontent:impl projects, which are upstream in the dependency graph. I could try and move it up to the :libs:xcontent project, but that would be tough because it depends on org.elasticsearch.common.bytes.BytesReference which is also in the :server project. And I can't just move that because it depends on lucene's BytesRef.

I also noticed we have a method XContentParser#charBuffer() that purports to return "a CharBuffer holding UTF-8 bytes", however I'm pretty sure that CharBuffers store chars internally, which are always UTF-16. So I can't just override that method.

The way I see it, there are a few options:

  1. I can try and split out the relevant parts of Text and BytesReference into separate base classes that I then move into the :libs:xcontent package.
  2. I can update XContentParser#charBuffer() to instead return a ByteBuffer. The api consumer would then use that ByteBuffer to construct a Text.
  3. We can just live with having 2 classes that do similar things: XContentString and Text.

@jordan-powers
Copy link
Contributor Author

Ok, following option 1 from my previous comment, I split out BytesReference, BytesArray, and Text into BaseBytesReference, BaseBytesArray, and BaseText, which I moved to :libs:core. I then updated my code to use these classes instead of the XContentString class I added earlier.

@jordan-powers
Copy link
Contributor Author

Ok, as discussed I reverted the BaseBytesRef and BaseText splitting. I also moved Text, although I had to put it in :libs:x-content instead of :libs:core because it implements the ToXContentFragment interface. I updated XContentParser#optimizedText() to return a Text.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got around to review this; it looks promising! Great work!

I have just a couple of comments:

Although I don't know for certain that elasticsearch won't parse json from another source (such as a configuration file or something) that's not UTF-8 encoded, so I'll leave it in

I think this is a very sensible decision, ++

Instead I added a separate XContentParser#xContentText(). If we want to update XContentParser#text() to return XContentString, I think that should be a separate follow-up so that it doesn't overwhelm this PR.

++ for a follow-up on this, I think it should be done as soon as possible but separately. I have a further comment related to this.

I know I'm a bit late :) but I'm not sure about using Text; I think that using an interface instead of a concrete type here would give us greater flexibility in the future if we want to perform further optimizations.

I recently worked trying to change parsing of JSON to use different libraries (not Jackson), and I really wanted text() to not return a String, for similar reasons; Text seems better, but it is still tying us to a concrete type.

Just to the top of my mind: have you considered using CharSequence? This would probably make "merging" text() and optimizedText() easier.

Or alternatively, introduce XContentString as an interface, and make Text derive from it (or extract an interface from Text -- same thing).

In any case, I would split all the Text related changes to a separate PR -- see my comment.

/**
* Method that will try to get underlying UTF-8 encoded bytes of the current string token.
* This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null.
* Currently, this is only implemented for ascii-only strings that do not contain escaped characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.

@@ -36,31 +31,37 @@ public static Text[] convertFromStringArray(String[] strings) {
return texts;
}

private BytesReference bytes;
private ByteBuffer bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change, but also that it might be better if this gets done in a smaller, self-contained preliminary PR.
It will reduce the complexity of this one, can be merged quickly, and it would be helpful to pinpoint and/or reduce the blast radius in case of problems

@jordan-powers
Copy link
Contributor Author

@ldematte Thanks for the review!

Just to the top of my mind: have you considered using CharSequence? This would probably make "merging" text() and optimizedText() easier.

The problem with using any char-related classes is that chars are UTF-16 encoded, which would defeat the purpose of this effort. There actually already is a method XContentParser#charBuffer() that purports to return a UTF8-encoded value, however the return type is a CharBuffer which is always internally UTF-16 encoded.

Or alternatively, introduce XContentString as an interface, and make Text derive from it (or extract an interface from Text -- same thing).

That makes sense to me. I'll split out a separate PR that updates Text to use ByteBuffers, moves Text to :libs:xcontent, adds an XContentString interface, and has Text implement the XContentString interface.

@jordan-powers
Copy link
Contributor Author

Ok, I opened #127666 to split out the updates to Text into another PR.

@jordan-powers jordan-powers added auto-backport Automatically create backport pull requests when merged v8.19.0 labels May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Core/Infra/Core Core issues without another label >non-issue :StorageEngine/Mapping The storage related side of mappings Team:Core/Infra Meta label for core/infra team Team:StorageEngine v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants