Skip to content

Skip UTF8 to UTF16 conversion during document indexing #126492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Jun 6, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
2153322
Prototype avoid UTF8 to UTF16 conversion
jordan-powers Apr 7, 2025
eaa66bb
Rename to ESBytesRef
jordan-powers Apr 8, 2025
c5d71f7
Apply spotless
jordan-powers Apr 8, 2025
a277aea
Some cleanup and comments
jordan-powers Apr 8, 2025
74822c2
Remove unnecessary throws IOException
jordan-powers Apr 8, 2025
a9ee991
Fix missing bytesValue call
jordan-powers Apr 8, 2025
ae432e1
Fix subsequent calls to parser.getText() after a call to parser.getVa…
jordan-powers Apr 8, 2025
3d1bec4
Use cached stringEnd on subsequent calls to getValueAsByteRef()
jordan-powers Apr 8, 2025
702ada2
Spotless
jordan-powers Apr 8, 2025
e2ebb92
Add textRefOrNull to DotExpandingXContentParser
jordan-powers Apr 8, 2025
4a6fe60
Add textRefOrNull() override to MultiFieldParser
jordan-powers Apr 8, 2025
240fbdb
Avoid cloning ByteSourceJsonBootstrapper
jordan-powers Apr 15, 2025
c1affcf
Rename ESBytesRef to XBytesRef
jordan-powers Apr 15, 2025
ddf7495
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 15, 2025
e285349
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 16, 2025
b0f3336
Add tests for ESJsonFactory
jordan-powers Apr 16, 2025
f2f106f
Add tests for ESUTF8StreamJsonParser
jordan-powers Apr 16, 2025
b0f701c
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 16, 2025
20616e6
Move RawString class into separate file and rename to EncodedString
jordan-powers Apr 17, 2025
1f66ff2
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 17, 2025
fd4ec6c
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 18, 2025
8913ca5
Combine XBytesRef and EncodedString into XContentString
jordan-powers Apr 22, 2025
082ffeb
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 22, 2025
3526556
Add missing override for new xContentText()
jordan-powers Apr 22, 2025
0ca2b60
Fix override in DotExpandingXContentParser
jordan-powers Apr 23, 2025
c12f1b1
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 23, 2025
9c88362
Fix GeoPointFieldMapper geohash
jordan-powers Apr 24, 2025
0d7ff66
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 24, 2025
d413cd3
Add some more tests
jordan-powers Apr 28, 2025
deefb81
Split Text and BytesReference and move base api to libs/core
jordan-powers Apr 28, 2025
681ce38
Use new BaseText and BaseBytesReference types
jordan-powers Apr 28, 2025
b31c2e0
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 28, 2025
603af45
Implement TODO UTF8 to UTF16 conversion
jordan-powers Apr 28, 2025
68d6c2d
Rename xContentText to optimizedText
jordan-powers Apr 30, 2025
1364683
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Apr 30, 2025
70202da
Rename xContentText in tests too
jordan-powers Apr 30, 2025
b3f4e04
Revert "Split Text and BytesReference and move base api to libs/core"
jordan-powers May 1, 2025
a40bee3
Move Text to :libs:x-content
jordan-powers May 1, 2025
84921aa
Use Text instead of XContentString
jordan-powers May 1, 2025
dbbdbb1
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers May 1, 2025
9380a0b
Fix missed reference to XContentString
jordan-powers May 1, 2025
ca03f87
Fix CI
jordan-powers May 5, 2025
180078c
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers May 5, 2025
b3a0bbf
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers May 8, 2025
8e36b5a
Rename test in BaseXContentTestCase to match
jordan-powers May 8, 2025
fb2394f
Update optimizedText to return XContentString interface
jordan-powers May 8, 2025
b9dc1da
Fix renamed length to stringLength
jordan-powers May 8, 2025
c38ff8a
Add OptimizedTextBenchmark
jordan-powers May 9, 2025
6c6b11e
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers May 9, 2025
de33cb6
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Jun 4, 2025
d75f180
Use new UTF8Bytes class
jordan-powers Jun 4, 2025
bcb195e
Use unsigned comparison in UTF8Bytes#compareTo
jordan-powers Jun 5, 2025
4c76525
Use encoded value when recording array offsets
jordan-powers Jun 5, 2025
986d1f6
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Jun 5, 2025
d3b9496
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Jun 5, 2025
9b53320
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Jun 6, 2025
91313b5
Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16
jordan-powers Jun 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Use new BaseText and BaseBytesReference types
  • Loading branch information
jordan-powers committed Apr 28, 2025
commit 681ce382f471fe8d9019e968ca37e96176b9f5c7
12 changes: 12 additions & 0 deletions libs/core/src/main/java/org/elasticsearch/core/BaseText.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,17 @@
public class BaseText {
protected BaseBytesReference bytes;
private String text;
private int length = -1;

public BaseText(BaseBytesReference bytes) {
this.bytes = bytes;
}

public BaseText(BaseBytesReference bytes, int length) {
this.bytes = bytes;
this.length = length;
}

public BaseText(String text) {
this.text = text;
}
Expand Down Expand Up @@ -69,4 +75,10 @@ public String toString() {
return string();
}

public int length() {
if (length < 0) {
length = string().length();
}
return length;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ public class BaseBytesArray implements BaseBytesReference {
private final int offset;
private final int length;

int hash = 0;
boolean hashIsZero = false;

public BaseBytesArray(byte[] bytes) {
this(bytes, 0, bytes.length);
}
Expand Down Expand Up @@ -48,4 +51,36 @@ public byte[] array() {
public int arrayOffset() {
return offset;
}

@Override
public boolean equals(Object other) {
if (this == other) {
return true;
}
if (other instanceof final BaseBytesReference otherRef) {
if (length != otherRef.length()) {
return false;
}
for (int i = 0; i < length; i++) {
if (get(i) != otherRef.get(i)) {
return false;
}
}
}
return true;
}

@Override
public int hashCode() {
if (hash == 0 && hashIsZero == false) {
hash = 1;
for (int i = offset; i < length; i++) {
hash = 31 * hash + bytes[i];
}
if (hash == 0) {
hashIsZero = true;
}
}
return hash;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@
import com.fasterxml.jackson.core.json.UTF8StreamJsonParser;
import com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer;

import org.elasticsearch.xcontent.XContentString;
import org.elasticsearch.core.BaseText;
import org.elasticsearch.core.bytes.BaseBytesArray;

import java.io.IOException;
import java.io.InputStream;
Expand Down Expand Up @@ -44,20 +45,20 @@ public ESUTF8StreamJsonParser(
* This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null.
* Currently, this is only implemented for ascii-only strings that do not contain escaped characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we need to support escaped characters? I don't think we have a test for this. Maybe a rest test (yaml or java) that tries to index a escaped json string field using keyword field mapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll definitely add a test.

I was thinking that support for escaped characters and/or possibly full unicode (not just ascii) could happen in a follow-up after this is merged. The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

That makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.

*/
public XContentString getValueAsByteRef() throws IOException {
public BaseText getValueAsText() throws IOException {
if (_currToken == JsonToken.VALUE_STRING && _tokenIncomplete) {
if (stringEnd > 0) {
final int len = stringEnd - 1 - _inputPtr;
// For now, we can use `len` for `charCount` because we only support ascii-encoded unescaped strings,
// which means each character uses exactly 1 byte.
return new XContentString(new XContentString.ByteRef(_inputBuffer, _inputPtr, len), len);
return new BaseText(new BaseBytesArray(_inputBuffer, _inputPtr, len), len);
}
return _finishAndReturnByteRef();
return _finishAndReturnText();
}
return null;
}

protected XContentString _finishAndReturnByteRef() throws IOException {
protected BaseText _finishAndReturnText() throws IOException {
int ptr = _inputPtr;
if (ptr >= _inputEnd) {
_loadMoreGuaranteed();
Expand All @@ -76,7 +77,7 @@ protected XContentString _finishAndReturnByteRef() throws IOException {
final int len = ptr - startPtr;
// For now, we can use `len` for `charCount` because we only support ascii-encoded unescaped strings,
// which means each character uses exactly 1 byte.
return new XContentString(new XContentString.ByteRef(inputBuffer, startPtr, len), len);
return new BaseText(new BaseBytesArray(inputBuffer, startPtr, len), len);
}
return null;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@
import com.fasterxml.jackson.core.exc.InputCoercionException;
import com.fasterxml.jackson.core.io.JsonEOFException;

import org.elasticsearch.core.BaseText;
import org.elasticsearch.core.IOUtils;
import org.elasticsearch.xcontent.XContentEOFException;
import org.elasticsearch.xcontent.XContentLocation;
import org.elasticsearch.xcontent.XContentParseException;
import org.elasticsearch.xcontent.XContentParserConfiguration;
import org.elasticsearch.xcontent.XContentString;
import org.elasticsearch.xcontent.XContentType;
import org.elasticsearch.xcontent.provider.XContentParserConfigurationImpl;
import org.elasticsearch.xcontent.support.AbstractXContentParser;
Expand Down Expand Up @@ -117,17 +117,17 @@ public String text() throws IOException {
}

@Override
public XContentString xContentText() throws IOException {
public BaseText xContentText() throws IOException {
if (currentToken().isValue() == false) {
throwOnNoText();
}
if (parser instanceof ESUTF8StreamJsonParser esParser) {
var bytesRef = esParser.getValueAsByteRef();
if (bytesRef != null) {
return bytesRef;
var text = esParser.getValueAsText();
if (text != null) {
return text;
}
}
return new XContentString(text());
return new BaseText(text());
}

private void throwOnNoText() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@

import org.elasticsearch.common.Strings;
import org.elasticsearch.core.CheckedConsumer;
import org.elasticsearch.core.bytes.BaseBytesArray;
import org.elasticsearch.core.bytes.BaseBytesReference;
import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.xcontent.XContentBuilder;
import org.elasticsearch.xcontent.XContentString;
import org.elasticsearch.xcontent.json.JsonXContent;
import org.hamcrest.Matchers;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class ESUTF8StreamJsonParserTests extends ESTestCase {

Expand All @@ -36,9 +36,8 @@ private void testParseJson(String input, CheckedConsumer<ESUTF8StreamJsonParser,
test.accept((ESUTF8StreamJsonParser) parser);
}

private void assertTextRef(XContentString.ByteRef textRef, String expectedValue) {
var data = Arrays.copyOfRange(textRef.bytes(), textRef.offset(), textRef.offset() + textRef.length());
assertThat(data, Matchers.equalTo(StandardCharsets.UTF_8.encode(expectedValue).array()));
private void assertTextBytes(BaseBytesReference textRef, String expectedValue) {
assertThat(textRef, Matchers.equalTo(new BaseBytesArray(expectedValue.getBytes(StandardCharsets.UTF_8))));
}

public void testGetValueAsByteRef() throws IOException {
Expand All @@ -47,14 +46,14 @@ public void testGetValueAsByteRef() throws IOException {
assertThat(parser.nextFieldName(), Matchers.equalTo("foo"));
assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));

XContentString.ByteRef textRef = parser.getValueAsByteRef().getBytes();
var textRef = parser.getValueAsText().bytes();
assertThat(textRef, Matchers.notNullValue());
assertThat(textRef.offset(), Matchers.equalTo(9));
assertThat(textRef.arrayOffset(), Matchers.equalTo(9));
assertThat(textRef.length(), Matchers.equalTo(3));
assertTextRef(textRef, "bar");
assertTextBytes(textRef, "bar");

assertThat(parser.getValueAsString(), Matchers.equalTo("bar"));
assertThat(parser.getValueAsByteRef(), Matchers.nullValue());
assertThat(parser.getValueAsText(), Matchers.nullValue());

assertThat(parser.nextToken(), Matchers.equalTo(JsonToken.END_OBJECT));
});
Expand All @@ -64,7 +63,7 @@ public void testGetValueAsByteRef() throws IOException {
assertThat(parser.nextFieldName(), Matchers.equalTo("foo"));
assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));

assertThat(parser.getValueAsByteRef(), Matchers.nullValue());
assertThat(parser.getValueAsText(), Matchers.nullValue());
assertThat(parser.getValueAsString(), Matchers.equalTo("bar\"baz\""));
});

Expand All @@ -73,7 +72,7 @@ public void testGetValueAsByteRef() throws IOException {
assertThat(parser.nextFieldName(), Matchers.equalTo("foo"));
assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));

assertThat(parser.getValueAsByteRef(), Matchers.nullValue());
assertThat(parser.getValueAsText(), Matchers.nullValue());
assertThat(parser.getValueAsString(), Matchers.equalTo("bår"));
});

Expand All @@ -84,29 +83,29 @@ public void testGetValueAsByteRef() throws IOException {

assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));
{
XContentString.ByteRef textRef = parser.getValueAsByteRef().getBytes();
var textRef = parser.getValueAsText().bytes();
assertThat(textRef, Matchers.notNullValue());
assertThat(textRef.offset(), Matchers.equalTo(10));
assertThat(textRef.arrayOffset(), Matchers.equalTo(10));
assertThat(textRef.length(), Matchers.equalTo(5));
assertTextRef(textRef, "lorem");
assertTextBytes(textRef, "lorem");
}

assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));
{
XContentString.ByteRef textRef = parser.getValueAsByteRef().getBytes();
var textRef = parser.getValueAsText().bytes();
assertThat(textRef, Matchers.notNullValue());
assertThat(textRef.offset(), Matchers.equalTo(19));
assertThat(textRef.arrayOffset(), Matchers.equalTo(19));
assertThat(textRef.length(), Matchers.equalTo(5));
assertTextRef(textRef, "ipsum");
assertTextBytes(textRef, "ipsum");
}

assertThat(parser.nextValue(), Matchers.equalTo(JsonToken.VALUE_STRING));
{
XContentString.ByteRef textRef = parser.getValueAsByteRef().getBytes();
var textRef = parser.getValueAsText().bytes();
assertThat(textRef, Matchers.notNullValue());
assertThat(textRef.offset(), Matchers.equalTo(28));
assertThat(textRef.arrayOffset(), Matchers.equalTo(28));
assertThat(textRef.length(), Matchers.equalTo(5));
assertTextRef(textRef, "dolor");
assertTextBytes(textRef, "dolor");
}

assertThat(parser.nextToken(), Matchers.equalTo(JsonToken.END_ARRAY));
Expand Down Expand Up @@ -151,9 +150,9 @@ public void testGetValueRandomized() throws IOException {

String currVal = values[i];
if (validForTextRef(currVal)) {
assertTextRef(parser.getValueAsByteRef().getBytes(), currVal);
assertTextBytes(parser.getValueAsText().bytes(), currVal);
} else {
assertThat(parser.getValueAsByteRef(), Matchers.nullValue());
assertThat(parser.getValueAsText(), Matchers.nullValue());
assertThat(parser.getValueAsString(), Matchers.equalTo(currVal));
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

package org.elasticsearch.xcontent;

import org.elasticsearch.core.BaseText;
import org.elasticsearch.core.CheckedFunction;
import org.elasticsearch.core.RestApiVersion;

Expand Down Expand Up @@ -100,11 +101,11 @@ public String textOrNull() throws IOException {
return delegate().textOrNull();
}

public XContentString xContentText() throws IOException {
public BaseText xContentText() throws IOException {
return delegate().xContentText();
}

public XContentString xContentTextOrNull() throws IOException {
public BaseText xContentTextOrNull() throws IOException {
return delegate().xContentTextOrNull();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

package org.elasticsearch.xcontent;

import org.elasticsearch.core.BaseText;
import org.elasticsearch.core.CheckedFunction;
import org.elasticsearch.core.Nullable;
import org.elasticsearch.core.RestApiVersion;
Expand Down Expand Up @@ -109,9 +110,9 @@ <T> Map<String, T> map(Supplier<Map<String, T>> mapFactory, CheckedFunction<XCon

String textOrNull() throws IOException;

XContentString xContentText() throws IOException;
BaseText xContentText() throws IOException;

XContentString xContentTextOrNull() throws IOException;
BaseText xContentTextOrNull() throws IOException;

CharBuffer charBufferOrNull() throws IOException;

Expand Down

This file was deleted.

Loading