-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Skip UTF8 to UTF16 conversion during document indexing #126492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Ok, I figured out a way to avoid cloning the file. Basically the problem was that detecting the encoding may possibly consume some bytes from the beginning of the input if there is a BOM present, and the property keeping track of how many bytes had been consumed was internal to Also, looking at the API docs, it seems elasticsearch only supports UTF-8 inputs anyway. So maybe I don't need to check the character encoding? Although I don't know for certain that elasticsearch won't parse json from another source (such as a configuration file or something) that's not UTF-8 encoded, so I'll leave it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great Jordan! Happy to see we don't need to fork entire files to get this.
Did you also see an increase in bulk indexing median throughput when running the elastic/logs track?
/** | ||
* Class that holds either a UTF-16 String or a UTF-8 BytesRef, and lazily converts between the two. | ||
*/ | ||
private static class RawString { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but it would be a bit tough since records are immutable. The lazy conversion works by only setting the bytesValue
or stringValue
once they've been converted, which wouldn't work if the properties were final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I falsely assumed that making RawString
's properties final would be straightforward. In that case turning this class into a record doesn't make sense.
/** | ||
* Method that will try to get underlying UTF-8 encoded bytes of the current string token. | ||
* This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null. | ||
* Currently, this is only implemented for ascii-only strings that do not contain escaped characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we need to support escaped characters? I don't think we have a test for this. Maybe a rest test (yaml or java) that tries to index a escaped json string field using keyword field mapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll definitely add a test.
I was thinking that support for escaped characters and/or possibly full unicode (not just ascii) could happen in a follow-up after this is merged. The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).
That makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.
No it was actually 0.27% slower, but that's such a small change I think it's noise. Latency was much lower though. I plan to rerun the benchmarks with logsdb. Hopefully that shows some improvement to throughput. Baseline:
Contender:
|
Ok, ran another benchmark. This time a multi-node logsdb run, similar to the nightlies. It seems that in this run, cumulative indexing time, latency, service time, and indexing throughput all increased. I'm not sure why indexing time, latency, and service time increased. Maybe those values are just noisy? Baseline:
Contender:
|
Ok I merged |
So I just noticed that we have the class I also noticed we have a method The way I see it, there are a few options:
|
Ok, following option 1 from my previous comment, I split out |
This reverts commit deefb81.
Ok, as discussed I reverted the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got around to review this; it looks promising! Great work!
I have just a couple of comments:
Although I don't know for certain that elasticsearch won't parse json from another source (such as a configuration file or something) that's not UTF-8 encoded, so I'll leave it in
I think this is a very sensible decision, ++
Instead I added a separate XContentParser#xContentText(). If we want to update XContentParser#text() to return XContentString, I think that should be a separate follow-up so that it doesn't overwhelm this PR.
++ for a follow-up on this, I think it should be done as soon as possible but separately. I have a further comment related to this.
I know I'm a bit late :) but I'm not sure about using Text; I think that using an interface instead of a concrete type here would give us greater flexibility in the future if we want to perform further optimizations.
I recently worked trying to change parsing of JSON to use different libraries (not Jackson), and I really wanted text()
to not return a String
, for similar reasons; Text seems better, but it is still tying us to a concrete type.
Just to the top of my mind: have you considered using CharSequence? This would probably make "merging" text()
and optimizedText()
easier.
Or alternatively, introduce XContentString
as an interface, and make Text
derive from it (or extract an interface from Text -- same thing).
In any case, I would split all the Text
related changes to a separate PR -- see my comment.
/** | ||
* Method that will try to get underlying UTF-8 encoded bytes of the current string token. | ||
* This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null. | ||
* Currently, this is only implemented for ascii-only strings that do not contain escaped characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.
@@ -36,31 +31,37 @@ public static Text[] convertFromStringArray(String[] strings) { | |||
return texts; | |||
} | |||
|
|||
private BytesReference bytes; | |||
private ByteBuffer bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change, but also that it might be better if this gets done in a smaller, self-contained preliminary PR.
It will reduce the complexity of this one, can be merged quickly, and it would be helpful to pinpoint and/or reduce the blast radius in case of problems
@ldematte Thanks for the review!
The problem with using any
That makes sense to me. I'll split out a separate PR that updates |
Ok, I opened #127666 to split out the updates to |
Initial implementation of avoiding the UTF8 to UTF16 back to UTF8 conversion that currently happens when parsing JSON.
Currently, the optimization only applies for ASCII-only strings that do not contain escaped characters (f.e.
\"
).Also, the optimization is currently only implemented for keyword fields.
The core logic to return the underlying bytes is implemented in
ESUTF8StreamJsonParser
. The other "ES" versions of Jackson classes are just the glue logic required to make Jackson create instances ofESUTF8StreamJsonParser
instead of the superclassUTF8StreamJsonParser
.I ran the prototype in esbench using the logging-indexing challenge on a standard-mode index, and I got these results:
Baseline:
Contender: