-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Optimize segment merging in the tsdb doc value codec #126111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort. This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment. Relates to #126111
When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort. This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment. Relates to #126111
Looks like all sub tasks have been implemented already. Is there anything left to do? |
Almost done here. There is one item left todo, I updated the task list. Additionally, the plan is to also implement bulk merging. If during merging, the next 128 values (or more) originate from one segment, then we can just copy the underlying bytes of a block instead of decoding and encoding the numbers or ordinals. I will open a separate issue for this. |
All tasks have been implemented. |
The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sort is performed (in order to get the doc ids from different segments in order of index sorting).
There are several reasons why the doc value instance is iterated multiple times:
This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.
The following changes should be made to address this performance issue:
numDocsWithField
as metadata and store jump table after the values (Prepare tsdb doc values format for merging optimizations. #125933).SortedNumericDocValues
(First step optimizing tsdb doc values codec merging. #125403).docValueCount
while iterating over values and write to later for the address offsets. (Coalesce getSortedNumeric calls for ES819 doc values merging #126732)The text was updated successfully, but these errors were encountered: