Skip to content

Add recursive chunker #126866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

dan-rubinstein
Copy link
Member

@dan-rubinstein dan-rubinstein commented Apr 15, 2025

Description

This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.

Testing

  • Unit testing
  • Manually tested chunking the CONTRIBUTING.md document.

@dan-rubinstein dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Apr 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dan-rubinstein, I've created a changelog YAML for you.


public class RecursiveChunkerTests extends ESTestCase {

private final List<String> TEST_SEPARATORS = List.of("\n\n", "\n", "\f", "\t", "#");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests that split by regex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I'll add tests that split by regex.

var chunkOffsets = new ArrayList<ChunkOffset>();
int chunkStart = 0;
int searchStart = 0;
while (matcher.find(searchStart)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs you can use matcher.find() here without the searchStart parameter. The matcher has internal state that tracks the position, find(int) resets that state then jumps to searchStart.

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/regex/Matcher.html#find()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests for this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll remove the searchStart. Can you clarify what you mean by adding tests for this function? It's a private function in the class so the overall chunking tests should be covering this?

import java.util.List;
import java.util.regex.Pattern;

public class RecursiveChunker implements Chunker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Java doc explaining how and what the RecursiveChunker does

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add this in.

var mergedChunk = chunkOffsets.getFirst();
for (int i = 1; i < chunkOffsets.size(); i++) {
var potentialMergedChunk = new ChunkOffset(mergedChunk.start(), chunkOffsets.get(i).end());
if (isChunkWithinMaxSize(input, potentialMergedChunk, maxChunkSize)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of word counting going on here, each merged chunk will be recounted from mergedChunk.start() again.

isChunkWithinMaxSize is called again on line 53 on the output of this function. Rather than using ChunkOffset create a record ChunkOffsetAndCount(int wordCount, int start, int end) to track the word counts for each chunk, when merging the chunks sum the word counts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll clean this logic up to minimize word counting operations. I think we can even just keep the chunk offset in tact as part of the ChunkOffsetAndCount object that way we don't have to rebuild it when building our return objects.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants