-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Use a competitive iterator in FiltersAggregator
#98360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The iterator is used to combine filtering with querying in leaf collection. Its benefit is that rangers with docs that are filtered out by all filters are skipped from doc collection. The competitive iterator is restricted to FiltersAggregator, not used in FilterByFilterAggregator that's already optimized. It only applies to top-level filter aggregations with no "other" bucket defined; the latter leads to collecting all docs so there's no point in skipping doc ranges. Fixes elastic#97544
# Conflicts: # server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
@elasticsearchmachine run elasticsearch-ci/part-2 |
Hi @kkrik-es, I've created a changelog YAML for you. |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few question. This looks in the right direction.
Did you already check what the speedup of this change is?
I think the filter-aggs challenge in the noaa rally track can give us this information.
@@ -107,7 +107,6 @@ public void testLookbackOnly() throws Exception { | |||
startDatafeed(datafeedConfig.getId(), 0L, now); | |||
assertBusy(() -> { | |||
DataCounts dataCounts = getDataCounts(job.getId()); | |||
assertThat(dataCounts.getProcessedRecordCount(), equalTo(numDocs + numDocs2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes to ML tests needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the competitive iterator kicks in sometimes and changess these metadata. We probably need someone who knows more about these tests to review the changes and, more importantly, assert that the actual output doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs double checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether filters aggs get used by data feeds, but even it is being used the counts that are returned shouldn't change. How did these assertions fail?
server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
Outdated
Show resolved
Hide resolved
final int maxDocId; | ||
|
||
// Tracks next doc matches as they get returned from the per-filter iterators. | ||
final PriorityQueue<Integer> nextDocs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Lucene's PriorityQueue
instead here and that contains DocIdSetIterator
instances?
(PriorityQueue#lessThan()
would compare DocIdSetIterartor#docId()
and each time a DocIdSetIterartor #advance(...)
is invoked PriorityQueue#updateTop(...)
should be invoked )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this pattern is used in other places as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm that may be somewhat tricky to use as we want to keep the last doc per filter to update the corresponding bucket. Let's discuss this offline in our sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can extract the last docid from this priority queue and use it in several places. But let's chat more about this in the sync.
|
||
final DocIdSetIterator[] filteredDocIdSetIterators; | ||
final int[] lastMatches; | ||
final DocId currentDocID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the DocId
class? This can just be a mutable int field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I see, this used in collector as well...
server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
Outdated
Show resolved
Hide resolved
Hi @kkrik-es, I've updated the changelog YAML for you. |
Rally reports a decent improvement, up to 18% win in 50 pct latency:
|
Adding @jpountz to take a look on search side, please feel free to reassign if too busy. |
final TwoPhaseIterator twoPhase = scorer.twoPhaseIterator(); | ||
final DocIdSetIterator iterator = (twoPhase != null) ? twoPhase.approximation() : scorer.iterator(); | ||
|
||
return new DocIdSetIterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Lucene already has this via TwoPhaseIterator#asDocIdSetIterator
. That said, I don't think we should use it because it defeats the point of using a TwoPhaseIterator
? For instance, if the main query matches doc IDs 3 and 10 only, and we call advance(3)
on the filter of the filter
aggregation: if 3 doesn't match, it will then try out 4 then 5... until it finds one that matches, but we don't care about them since they don't match the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the competitive aggregator would ignore the TwoPhaseIterator
and just look at the approximation part, so that TwoPhaseIterator#matches
would only be called when both the main query and the filter match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adrien. I updated FiltersAggregator
, is this what you had in mind? If so, I'll go ahead and revert changes to Lucene.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. I pushed a small change that shows what I have in mind athttps://github.com/elastic/elasticsearch/compare/main...jpountz:elasticsearch:filter_competitive_iterator_poc?expand=1. Beware it's not tested and has at least one bug, but hopefully it helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is much cleaner. Ptal
server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/bucket/filter/FiltersAggregator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good! The only remaining issue is the changes to the ml data feed integration tests.
@@ -107,7 +107,6 @@ public void testLookbackOnly() throws Exception { | |||
startDatafeed(datafeedConfig.getId(), 0L, now); | |||
assertBusy(() -> { | |||
DataCounts dataCounts = getDataCounts(job.getId()); | |||
assertThat(dataCounts.getProcessedRecordCount(), equalTo(numDocs + numDocs2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether filters aggs get used by data feeds, but even it is being used the counts that are returned shouldn't change. How did these assertions fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor suggestions but this looks good to me. Is my understanding correct that this change also optimizes the filter
aggregation, in addition to the filters
aggregation? I'd be interested in looking at updated benchmark numbers since the logic changed since you first looked into performance numbers.
.../src/main/java/org/elasticsearch/search/aggregations/bucket/filter/QueryToFilterAdapter.java
Show resolved
Hide resolved
IntPredicate[] docFilters = new IntPredicate[filters().size()]; | ||
for (int filterOrd = 0; filterOrd < filters().size(); filterOrd++) { | ||
docFilters[filterOrd] = filters().get(filterOrd).matchingDocIds(aggCtx.getLeafReaderContext()); | ||
final boolean useCompetitiveIterator = (parent == null && otherBucketKey == null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should also not use the competitive iterator if the sum of the costs (scorer.iterator().cost()
) of filters is greater than or equal to maxDoc
, which suggests that most documents would match one filter, ie. configuring a competitive iterator would add a bit of overhead but unlikely save evaluating some documents.
Yes, the |
public DocIdSetIterator competitiveIterator() throws IOException { | ||
if (useCompetitiveIterator) { | ||
// A DocIdSetIterator view of the filterIterators heap | ||
return new DisjunctionDISIApproximation(filterIterators); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code of DisjunctionDISIApproximation
, it will not like if filterIterators
is empty. Maybe check for emptiness and return a DocIdSetIterator.empty()
in that case? It should be possible to write a test case by using filters on terms that do not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yeah that's why I was returning a scorer with an empty DISI instead of null. I updated the code to address this.
Note that, if filterIterators is empty, we won't be using the competitive iterator in the latest version.
The numbers I got from my latest run are more of a mixed bag. Granted, I'm using a different (better) laptop (M2 vs M1, 32GB vs 16GB), but I'd hope there would be a more clear benefit across the board. Still, there's a win in tail latency up to 40%:
|
A potential idea for speeding this up further would be to specialize for the case when there is a single filter, which I assume is quite common. In this case we can save the overhead of the heap in collect() and in the |
👍 Let's create a followup issue for this. |
I opened #99202. |
The iterator is used to combine filtering with querying in leaf
collection. Its benefit is that rangers with docs that are filtered out
by all filters are skipped from doc collection.
The competitive iterator is restricted to FiltersAggregator, not used in
FilterByFilterAggregator that's already optimized. It only applies to
top-level filter aggregations with no "other" bucket defined; the latter
leads to collecting all docs so there's no point in skipping doc ranges.
Fixes #97544