Optimize segment merging in the tsdb doc value codec #126111

martijnvg · 2025-04-02T09:27:39Z

The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sort is performed (in order to get the doc ids from different segments in order of index sorting).

There are several reasons why the doc value instance is iterated multiple times:

To compute stats (num values, number of docs with value) required for writing values to disk.
To write bitset that indicate which documents have a value. (indexed disi, jump table)
To write the actual values to disk.
To write the addresses to disk (in case docs have multiple values)

This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.

The following changes should be made to address this performance issue:

Change the tsdb doc values format to allows store numDocsWithField as metadata and store jump table after the values (Prepare tsdb doc values format for merging optimizations. #125933).
Reuse statistics used during merging from the metadata instead of computing it on the fly by creating a merged SortedNumericDocValues (First step optimizing tsdb doc values codec merging. #125403).
Keep track of documents with value while iterating over values and use that to write jump table later (Tsdb doc values inline building jump table #126499)
Keep track of docValueCount while iterating over values and write to later for the address offsets. (Coalesce getSortedNumeric calls for ES819 doc values merging #126732)
Optimize merging binary doc value. By accumulating offsets and disi, so that we iterate once. (Apply TSDB jump table and offset construction optimizations to binary doc values #127278)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-04-02T10:09:41Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort. This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment. Relates to #126111

felixbarny · 2025-04-22T08:42:18Z

Looks like all sub tasks have been implemented already. Is there anything left to do?

martijnvg · 2025-04-22T08:50:53Z

Is there anything left to do?

Almost done here. There is one item left todo, I updated the task list.

Additionally, the plan is to also implement bulk merging. If during merging, the next 128 values (or more) originate from one segment, then we can just copy the underlying bytes of a block instead of decoding and encoding the numbers or ordinals. I will open a separate issue for this.

martijnvg · 2025-04-25T12:29:51Z

All tasks have been implemented.

martijnvg self-assigned this Apr 2, 2025

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 2, 2025

martijnvg added Meta :StorageEngine/Codec labels Apr 2, 2025

elasticsearchmachine added the Team:StorageEngine label Apr 2, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Apr 2, 2025

jordan-powers mentioned this issue Apr 11, 2025

Coalesce getSortedNumeric calls for ES819 doc values merging #126732

Merged

martijnvg assigned jordan-powers and unassigned martijnvg Apr 23, 2025

jordan-powers mentioned this issue Apr 23, 2025

Apply TSDB jump table and offset construction optimizations to binary doc values #127278

Merged

martijnvg closed this as completed Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize segment merging in the tsdb doc value codec #126111

Optimize segment merging in the tsdb doc value codec #126111

martijnvg commented Apr 2, 2025 •

edited

Loading

elasticsearchmachine commented Apr 2, 2025

felixbarny commented Apr 22, 2025

martijnvg commented Apr 22, 2025

martijnvg commented Apr 25, 2025

Optimize segment merging in the tsdb doc value codec #126111

Optimize segment merging in the tsdb doc value codec #126111

Comments

martijnvg commented Apr 2, 2025 • edited Loading

elasticsearchmachine commented Apr 2, 2025

felixbarny commented Apr 22, 2025

martijnvg commented Apr 22, 2025

martijnvg commented Apr 25, 2025

martijnvg commented Apr 2, 2025 •

edited

Loading