-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add recursive chunker #126866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add recursive chunker #126866
Conversation
Hi @dan-rubinstein, I've created a changelog YAML for you. |
|
||
public class RecursiveChunkerTests extends ESTestCase { | ||
|
||
private final List<String> TEST_SEPARATORS = List.of("\n\n", "\n", "\f", "\t", "#"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests that split by regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, I'll add tests that split by regex.
var chunkOffsets = new ArrayList<ChunkOffset>(); | ||
int chunkStart = 0; | ||
int searchStart = 0; | ||
while (matcher.find(searchStart)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the docs you can use matcher.find()
here without the searchStart
parameter. The matcher has internal state that tracks the position, find(int)
resets that state then jumps to searchStart
.
https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/regex/Matcher.html#find()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll remove the searchStart
. Can you clarify what you mean by adding tests for this function? It's a private function in the class so the overall chunking tests should be covering this?
import java.util.List; | ||
import java.util.regex.Pattern; | ||
|
||
public class RecursiveChunker implements Chunker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Java doc explaining how and what the RecursiveChunker does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add this in.
var mergedChunk = chunkOffsets.getFirst(); | ||
for (int i = 1; i < chunkOffsets.size(); i++) { | ||
var potentialMergedChunk = new ChunkOffset(mergedChunk.start(), chunkOffsets.get(i).end()); | ||
if (isChunkWithinMaxSize(input, potentialMergedChunk, maxChunkSize)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of word counting going on here, each merged chunk will be recounted from mergedChunk.start()
again.
isChunkWithinMaxSize
is called again on line 53 on the output of this function. Rather than using ChunkOffset
create a record ChunkOffsetAndCount(int wordCount, int start, int end)
to track the word counts for each chunk, when merging the chunks sum the word counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll clean this logic up to minimize word counting operations. I think we can even just keep the chunk offset in tact as part of the ChunkOffsetAndCount
object that way we don't have to rebuild it when building our return objects.
Pinging @elastic/ml-core (Team:ML) |
Description
This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.
Testing
CONTRIBUTING.md
document.