Skip to content

Tsdb doc values inline building jump table #126499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 17, 2025

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Apr 9, 2025

Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table.

In case when indexing sorting is active, this requires an additional sorting of segments while merging.

Follow up from #125403

Build jump table (disi) when iterating over SortedNumericDocValues, instead of separately iterating over SortedNumericDocValues.

In case when indexing sorting is active, this requires an additional merge sort.

Follow up from elastic#125403
@martijnvg
Copy link
Member Author

martijnvg commented Apr 11, 2025

The latest micro benchmark result on top of this PR.

Benchmark                                                          (deltaTime)   (nDocs)  (seed)  Mode  Cnt      Score   Error  Units
TSDBDocValuesMergeBenchmark.forceMergeDenseWithOptimizedMerge             1000  20431204      42    ss       12322.078          ms/op
TSDBDocValuesMergeBenchmark.forceMergeDenseWithoutOptimizedMerge          1000  20431204      42    ss       18732.804          ms/op
TSDBDocValuesMergeBenchmark.forceMergeSparseWithOptimizedMerge            1000  20431204      42    ss       10461.210          ms/op
TSDBDocValuesMergeBenchmark.forceMergeSparseWithoutOptimizedMerge         1000  20431204      42    ss       13807.956          ms/op

Two new benchmark methods were added that test the sparse case. The dense benchmark methods were the same as before. The benchmark method changed to Mode.SingleShotTime, which makes more sense given that only one force merge is executed per benchmark method. In both sparse and dense case the optimized merge is faster.

@dnhatn dnhatn self-requested a review April 11, 2025 20:53
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this approach. Thanks Martijn!

encoder.encode(buffer, data);
IndexOutput disiTempOutput = null;
String skipListTempFileName = null;
IndexedDISIBuilder docIdSetBuilder = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make IndexedDISIBuilder (or maybe Accumulator is a better name?) Closable, pass a Directory in its constructor, and pass the IndexOutput in the build (or flush) method to copy the temporary output to the data output? This way, we only need a single reference here, making the code more manageable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really made the code more manageable! I also forked the original IndexedDISI tests and adapted that for DISIAccumulator.

values = valuesProducer.getSortedNumeric(field);
final int bitsPerOrd = maxOrd >= 0 ? PackedInts.bitsRequired(maxOrd - 1) : -1;
if (enableOptimizedMerge && numDocsWithValue < maxDoc) {
// TODO: which IOContext should be used here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use MERGE for this IOContext?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a MERGE constant for io context. But I did the following instead: 45308e4

I think this way we always get most appropriate io context? In case of merge it will be an io context for merging?

@dnhatn dnhatn self-requested a review April 15, 2025 19:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks Martijn!

this.denseRankPower = denseRankPower;

this.origo = disiTempOutput.getFilePointer(); // All jumps are relative to the origo
if ((denseRankPower < 7 || denseRankPower > 15) && denseRankPower != -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move this check above dir.createTempOutput to avoid leaks if the check fails.

@martijnvg martijnvg added the auto-backport Automatically create backport pull requests when merged label Apr 17, 2025
@martijnvg martijnvg enabled auto-merge (squash) April 17, 2025 07:45
@martijnvg
Copy link
Member Author

Recent run on top of latest commit of this pr:

Benchmark                                                          (deltaTime)   (nDocs)  (seed)  Mode  Cnt      Score   Error  Units
TSDBDocValuesMergeBenchmark.forceMergeDenseWithOptimizedMerge             1000  20431204      42    ss        8507.824          ms/op
TSDBDocValuesMergeBenchmark.forceMergeDenseWithoutOptimizedMerge          1000  20431204      42    ss       12509.560          ms/op
TSDBDocValuesMergeBenchmark.forceMergeSparseWithOptimizedMerge            1000  20431204      42    ss       10777.702          ms/op
TSDBDocValuesMergeBenchmark.forceMergeSparseWithoutOptimizedMerge         1000  20431204      42    ss       17307.381          ms/op

@martijnvg martijnvg merged commit 0d41e9a into elastic:main Apr 17, 2025
16 of 17 checks passed
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Apr 17, 2025
Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table.

In case when indexing sorting is active, this requires an additional merge sort. Follow up from elastic#125403
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Apr 17, 2025
Build jump table (disi) while iterating over SortedNumericDocValues for encoding the values, instead of separately iterating over SortedNumericDocValues just to build the jump table.

In case when indexing sorting is active, this requires an additional merge sort. Follow up from #125403
jordan-powers added a commit that referenced this pull request Apr 24, 2025
Applies the merge optimizations from #126499 and #126732 to binary field
types for the ES819 codec.
elasticsearchmachine pushed a commit that referenced this pull request Apr 24, 2025
#127346)

Applies the merge optimizations from #126499 and #126732 to binary field
types for the ES819 codec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants