Add recursive chunker #126866

dan-rubinstein · 2025-04-15T18:37:13Z

Description

This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.

Testing

Unit testing
Manually tested chunking the CONTRIBUTING.md document.

elasticsearchmachine · 2025-04-15T18:37:37Z

Hi @dan-rubinstein, I've created a changelog YAML for you.

…ettings

davidkyle · 2025-04-25T13:01:11Z

...nference/src/test/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunkerTests.java

+
+public class RecursiveChunkerTests extends ESTestCase {
+
+    private final List<String> TEST_SEPARATORS = List.of("\n\n", "\n", "\f", "\t", "#");


Please add some tests that split by regex

Definitely, I'll add tests that split by regex.

davidkyle · 2025-04-25T13:09:08Z

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java

+        var chunkOffsets = new ArrayList<ChunkOffset>();
+        int chunkStart = 0;
+        int searchStart = 0;
+        while (matcher.find(searchStart)) {


Looking at the docs you can use matcher.find() here without the searchStart parameter. The matcher has internal state that tracks the position, find(int) resets that state then jumps to searchStart.

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/regex/Matcher.html#find()

Please add some tests for this function

Good catch, I'll remove the searchStart. Can you clarify what you mean by adding tests for this function? It's a private function in the class so the overall chunking tests should be covering this?

davidkyle · 2025-04-25T13:11:12Z

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java

+import java.util.List;
+import java.util.regex.Pattern;
+
+public class RecursiveChunker implements Chunker {


Please add Java doc explaining how and what the RecursiveChunker does

Sure, I'll add this in.

davidkyle · 2025-04-25T13:20:05Z

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java

+        var mergedChunk = chunkOffsets.getFirst();
+        for (int i = 1; i < chunkOffsets.size(); i++) {
+            var potentialMergedChunk = new ChunkOffset(mergedChunk.start(), chunkOffsets.get(i).end());
+            if (isChunkWithinMaxSize(input, potentialMergedChunk, maxChunkSize)) {


There is a lot of word counting going on here, each merged chunk will be recounted from mergedChunk.start() again.

isChunkWithinMaxSize is called again on line 53 on the output of this function. Rather than using ChunkOffset create a record ChunkOffsetAndCount(int wordCount, int start, int end) to track the word counts for each chunk, when merging the chunks sum the word counts.

Sure, I'll clean this logic up to minimize word counting operations. I think we can even just keep the chunk offset in tact as part of the ChunkOffsetAndCount object that way we don't have to rebuild it when building our return objects.

elasticsearchmachine · 2025-04-28T18:03:13Z

Pinging @elastic/ml-core (Team:ML)

Add recursive chunker

5167b21

dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Apr 15, 2025

Update docs/changelog/126866.yaml

7d9e07c

github-actions bot deployed to docs-preview April 15, 2025 18:38 View deployment

Merge branch 'main' into recursive-chunking-strategy

8418223

github-actions bot deployed to docs-preview April 16, 2025 18:28 View deployment

Clean up separator sets and add asMap function for RecrusiveChunkingS…

0685124

…ettings

github-actions bot deployed to docs-preview April 23, 2025 19:48 View deployment

davidkyle reviewed Apr 25, 2025

View reviewed changes

Add javadoc for chunker, add tests, reduce word counting operations

f40947a

dan-rubinstein marked this pull request as ready for review April 28, 2025 18:02

github-actions bot deployed to docs-preview April 28, 2025 18:03 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recursive chunker #126866

Add recursive chunker #126866

dan-rubinstein commented Apr 15, 2025 •

edited

Loading

elasticsearchmachine commented Apr 15, 2025

davidkyle Apr 25, 2025

dan-rubinstein Apr 28, 2025

davidkyle Apr 25, 2025

davidkyle Apr 25, 2025

dan-rubinstein Apr 28, 2025

davidkyle Apr 25, 2025

dan-rubinstein Apr 25, 2025

davidkyle Apr 25, 2025

dan-rubinstein Apr 25, 2025

elasticsearchmachine commented Apr 28, 2025


		public class RecursiveChunkerTests extends ESTestCase {

		private final List<String> TEST_SEPARATORS = List.of("\n\n", "\n", "\f", "\t", "#");

Add recursive chunker #126866

Are you sure you want to change the base?

Add recursive chunker #126866

Conversation

dan-rubinstein commented Apr 15, 2025 • edited Loading

Description

Testing

elasticsearchmachine commented Apr 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 28, 2025

dan-rubinstein commented Apr 15, 2025 •

edited

Loading