Skip to content

Implement competitiveIterator() on FilterAggregator #97544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jpountz opened this issue Jul 11, 2023 · 1 comment · Fixed by #98360
Closed

Implement competitiveIterator() on FilterAggregator #97544

jpountz opened this issue Jul 11, 2023 · 1 comment · Fixed by #98360
Assignees
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2023

Description

We have had this longstanding issue that running a top-level filter aggregation is often (not always, e.g. not if you need the total hit count too) a performance bug, as Elasticsearch would iterate over docs that match the query and would then check each of these docs against the filter of the filter aggregation. It's usually a better idea to put the filter directly into the query, where it can be used to drive iteration if appropriate.

We could do this automatically by implementing the LeafCollector#competitiveIterator API that was introduced for dynamic pruning, in order to automatically start intersecting the query with the filter of the filter aggregator once enough hits have been counted.

@jpountz jpountz added >enhancement needs:triage Requires assignment of a team area label :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Jul 11, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@kkrik-es kkrik-es self-assigned this Jul 14, 2023
kkrik-es added a commit to kkrik-es/elasticsearch that referenced this issue Aug 10, 2023
The iterator is used to combine filtering with querying in leaf
collection. Its benefit is that rangers with docs that are filtered out
by all filters are skipped from doc collection.

The competitive iterator is restricted to FiltersAggregator, not used in
FilterByFilterAggregator that's already optimized. It only applies to
top-level filter aggregations with no "other" bucket defined; the latter
leads to collecting all docs so there's no point in skipping doc ranges.

Fixes elastic#97544
kkrik-es added a commit that referenced this issue Sep 5, 2023
* Use a competitive iterator in FiltersAggregator.

The iterator is used to combine filtering with querying in leaf
collection. Its benefit is that rangers with docs that are filtered out
by all filters are skipped from doc collection.

The competitive iterator is restricted to FiltersAggregator, not used in
FilterByFilterAggregator that's already optimized. It only applies to
top-level filter aggregations with no "other" bucket defined; the latter
leads to collecting all docs so there's no point in skipping doc ranges.

Fixes #97544

* Fix function name.

* Advance iterator on two-phase mismatch.

* Restore docid tracking.

* Fix failing tests.

* Fix failing test.

* Fix more tests.

* Update docs/changelog/98360.yaml

* More test fixes.

* Update docs/changelog/98360.yaml

* Skip checking useCompetitiveIterator in collect

* Find approximate matches in CompetitiveIterator

* Use DisiPriorityQueue to simplify FiltersAggregator

* Skip competitive iterator when all docs match.

* Check for empty priority queue.
kkrik-es added a commit that referenced this issue Sep 6, 2023
* Use a competitive iterator in FiltersAggregator.

The iterator is used to combine filtering with querying in leaf
collection. Its benefit is that rangers with docs that are filtered out
by all filters are skipped from doc collection.

The competitive iterator is restricted to FiltersAggregator, not used in
FilterByFilterAggregator that's already optimized. It only applies to
top-level filter aggregations with no "other" bucket defined; the latter
leads to collecting all docs so there's no point in skipping doc ranges.

Fixes #97544

* Fix function name.

* Advance iterator on two-phase mismatch.

* Restore docid tracking.

* Fix failing tests.

* Fix failing test.

* Fix more tests.

* Update docs/changelog/98360.yaml

* More test fixes.

* Update docs/changelog/98360.yaml

* Skip checking useCompetitiveIterator in collect

* Find approximate matches in CompetitiveIterator

* Use DisiPriorityQueue to simplify FiltersAggregator

* Skip competitive iterator when all docs match.

* Check for empty priority queue.

* Skip DisiPriorityQueue on single filter agg.

When FiltersAggregator has a single filter, there is no benefit in using
a DisiPriorityQueue as the heap will only contain values from a single
iterator. In such a case, it's preferable to use the filtering
approximation iterator directly as competitive iterator.

Fixes #99202

* Update docs/changelog/99215.yaml

* Use FilterMatchingDisiWrapper in leaf collectors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants