Skip to content

Optimize segment merging in the tsdb doc value codec #126111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
martijnvg opened this issue Apr 2, 2025 · 4 comments
Closed
5 tasks done

Optimize segment merging in the tsdb doc value codec #126111

martijnvg opened this issue Apr 2, 2025 · 4 comments

Comments

@martijnvg
Copy link
Member

martijnvg commented Apr 2, 2025

The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sort is performed (in order to get the doc ids from different segments in order of index sorting).

There are several reasons why the doc value instance is iterated multiple times:

  • To compute stats (num values, number of docs with value) required for writing values to disk.
  • To write bitset that indicate which documents have a value. (indexed disi, jump table)
  • To write the actual values to disk.
  • To write the addresses to disk (in case docs have multiple values)

This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.

The following changes should be made to address this performance issue:

@martijnvg martijnvg self-assigned this Apr 2, 2025
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Apr 2, 2025
jordan-powers added a commit that referenced this issue Apr 18, 2025
When writing the doc values addresses, we currently perform an iteration 
over all the sorted numeric doc values to calculate the addresses. When
merging sorted segments, this iteration is expensive as it requires
performing a merge sort.

This patch removes this iteration by instead calculating the addresses
while we are writing the values, writing the addresses to a  temporary
file. Afterwards, they are copied from the temporary file into the
merged segment.

Relates to #126111
elasticsearchmachine pushed a commit that referenced this issue Apr 18, 2025
When writing the doc values addresses, we currently perform an iteration 
over all the sorted numeric doc values to calculate the addresses. When
merging sorted segments, this iteration is expensive as it requires
performing a merge sort.

This patch removes this iteration by instead calculating the addresses
while we are writing the values, writing the addresses to a  temporary
file. Afterwards, they are copied from the temporary file into the
merged segment.

Relates to #126111
@felixbarny
Copy link
Member

Looks like all sub tasks have been implemented already. Is there anything left to do?

@martijnvg
Copy link
Member Author

Is there anything left to do?

Almost done here. There is one item left todo, I updated the task list.

Additionally, the plan is to also implement bulk merging. If during merging, the next 128 values (or more) originate from one segment, then we can just copy the underlying bytes of a block instead of decoding and encoding the numbers or ordinals. I will open a separate issue for this.

@martijnvg
Copy link
Member Author

All tasks have been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants