Skip to content

Push down field extraction to time-series source #127445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 28, 2025

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 27, 2025

This change pushes down field extractions to the time-series source operator, providing these advantages:

  • Avoids building DocVector and its forward/backward maps.
  • Leverages the DocValues cache (i.e., blocks that are already decompressed/decoded) when loading values, which can be lost when reading blocks with the ValuesSourceReaderOperator.
  • Eliminates the need to rebuild blocks with backward mappings after reading values.

The following query against the TSDB track previously took 19 seconds but was reduced to 13 seconds with this change:

TS tsdb 
| STATS sum(rate(kubernetes.container.memory.pagefaults)) by bucket(@timestamp, 5minute)

Note that with this change:

TS tsdb 
| STATS sum(rate(kubernetes.container.memory.pagefaults)) by bucket(@timestamp, 5minute)

now performs as well as:

FROM tsdb 
| STATS sum(last_over_time(kubernetes.container.memory.pagefaults)) by bucket(@timestamp, 5minute)

when using the shard level data partitioning. This means the performance of the TS command is comparable to the FROM command, except that it does not yet support segment-level or doc-level concurrency. I will try to add support for segment-level concurrency, as document-level partitioning is not useful when iterating over documents in order.

@dnhatn dnhatn force-pushed the time-series-field-extraction branch from cda5378 to f68492b Compare April 27, 2025 23:55
@dnhatn dnhatn changed the title Push field extraction to time-series source Push down field extraction to time-series source Apr 27, 2025
@dnhatn dnhatn force-pushed the time-series-field-extraction branch 4 times, most recently from db193c6 to e48b3e2 Compare April 28, 2025 04:44
@dnhatn dnhatn force-pushed the time-series-field-extraction branch from e48b3e2 to aa331a5 Compare April 28, 2025 15:05
@dnhatn dnhatn added :StorageEngine/TSDB You know, for Metrics >non-issue labels Apr 28, 2025
@dnhatn dnhatn marked this pull request as ready for review April 28, 2025 15:08
@dnhatn dnhatn requested a review from kkrik-es April 28, 2025 15:08
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn dnhatn requested a review from not-napoleon April 28, 2025 15:08
if (plan.anyMatch(p -> p instanceof EsQueryExec q && q.indexMode() == IndexMode.TIME_SERIES) == false) {
return plan;
}
final List<FieldExtractExec> pushDownExtracts = new ArrayList<>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change in planning, where field extractions are pushed down to the time-series source

}
}

static final class ShardLevelFieldsReader implements Releasable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we cannot reuse the ValuesSourceReaderOperator and must duplicate some logic here.

timestampsBuilder = blockFactory.newLongVectorBuilder(Math.min(remainingDocs, maxPageSize));
tsids = tsHashesBuilder.build();
int blockIndex = 0;
if (emitDocIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (docCollector != null) {

for consistency?

import java.util.Set;

/**
* A rule that pushes down field extractions to occur before filter/limit/topN in the time-series source plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want filtering to happen before field extraction? Or combine them, at least? Maybe I misunderstood this comment..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the query TS index | WHERE host = 'a' | STATS max(rate(counter)) BY host, bucket(1minute), we should extract only the host field (and tsid and timestamp) in the time-series source command. The counter field should be extracted later by the ValuesSourceReaderOperator because we don't know how many rows of will be filtered out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, what if we have

TS index | WHERE host = 'a' AND TRANGE(1hour) | STATS max(rate(counter)) BY host, bucket(1minute)

I'd think we want to push down the filters on host and @timestamp to Lucene, to run before extracting the fields. If this is the case, maybe clarify in the comment that filters on the extracted fields are still pushed down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the javadoc: 164d788

@dnhatn dnhatn requested a review from kkrik-es April 28, 2025 17:55
} else {
sourceLoader = null;
}
this.storedFieldsSpec = storedFieldsSpec;
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, removed 144f13d

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be lying if I claimed I fully understand this change.. It'd be great if Mark can also take a look, or maybe Nik. That said, we'll be iterating on this code, I expect it to be extended and updated iteratively.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 28, 2025

Thanks Kostas!

@dnhatn dnhatn merged commit d65f34d into elastic:main Apr 28, 2025
16 of 17 checks passed
@dnhatn dnhatn deleted the time-series-field-extraction branch April 28, 2025 21:41
@dnhatn dnhatn mentioned this pull request Apr 28, 2025
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants