Skip to content

lookup join first draft #123719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 231 commits into from
Closed

Conversation

georgewallace
Copy link
Contributor

No description provided.

@georgewallace georgewallace added >docs General docs changes Team:Docs Meta label for docs team labels Feb 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @georgewallace , this is shaping up nicely!

I have another round of comments.

```


In case of name collisions, the newly created columns will override existing columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also provide examples of this.

* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported.
* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it.
* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes.
* This limit is per page of data which is about about 10,000 rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if pages of data are a well-defined concept (at least user-facing). I also don't know if we can say this is about 10k rows. (@nik9000 may have more precise ideas on this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to know the exacts here, I took a stab based on what I read but unsure what the limit is. Also will they get a specific messagE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check in with Nik to get a better, but still precise wording here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think they will get a specific error message, probably just a generic circuit breaker exception.

Copy link
Contributor

@alex-spies alex-spies Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

The query will circuit break if there are too many matching documents in the lookup index, or if the documents are too large.
More precisely, `LOOKUP JOIN` works in batches of, normally, about 10,000 rows; a large amount of heap space is needed if the matching documents from the lookup index for a batch are multiple megabytes or larger.
This is roughly the same as for `ENRICH`.

@nik9000 , could you please keep me honest here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Alex's proposal though I might say "about" instead of "normally". It is a bit fuzzy, but that's what you get when you don't describe pages to users precisely.

One thing that we could add is that larger nodes will allow bigger fetches.

This is fairly temporary - we should switch to a stream of results at some point.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, thanks for the iterations. I think this is shaping up nicely.

@@ -17,11 +17,12 @@ For example, you can use `ENRICH` to:

[`ENRICH`](/reference/query-languages/esql/esql-commands.md#esql-enrich) is similar to [`LOOKUP join`](/reference/query-languages/esql/esql-commands.md#esql-lookup-join) in the fact that they both help you join data together. You should use `ENRICH` when:

* Enrichment data doesn't changes frequently
* Enrichment data doesn't change frequently
* You can accept index-time overhead
* You are working with structured enrichment patterns
* You can accept having multiple matches combined into multi-values
* You can accept being limited to predefined match fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matching logic of "which lookup/enrich document matches a given input row" is also more lenient for ENRICH compared to LOOKUP JOIN. For instance, for LOOKUP JOIN, multivalued join keys like ["foo", "bar"] don't match anything if they occur in the main or lookup index; for ENRICH, there'll be a match if any value for the main index' join key is contained in the lookup index' join key.

I'll gather more precise information for this, we can maybe add this in a follow-up PR.

* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported.
* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it.
* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes.
* This limit is per page of data which is about about 10,000 rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check in with Nik to get a better, but still precise wording here.

Copy link
Contributor

@leemthompo leemthompo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this @georgewallace 🙏 , I have a few ideas and suggestions.

Other than my comments some other thoughts:

  • We need applies_to tags
    • Makes me ask the question if there are any notable differences between serverless and hosted or self-managed deployments for LOOKUP_JOIN?
  • Might be helpful to add example responses where possible

`LOOKUP JOIN` is useful for any scenario where you need to pull in information from a lookup index to streamline data enrichment and analysis.

**Syntax**

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess should mention the raw syntax at top:

FROM <source_index>
| LOOKUP JOIN <lookup_index> ON <field_name>

TBD
The `LOOKUP JOIN` command adds new columns to your {esql} query results table by finding documents in a lookup index that share the same join field value as your result rows.

For each row in your results table that matches a document in the lookup index based on the join field, all fields from the matching document are added as new columns to that row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all fields except the join key are added from the lookup index, the join key column is preserved.

Co-authored-by: Liam Thompson <[email protected]>
Co-authored-by: Alexander Spies <[email protected]>
thecoop and others added 20 commits March 10, 2025 18:59
This isn't even saving any lines of code and is a measurable
source of both cache and branch-prediction misses on the hot
and critical transport-thread path.
=> inline it
This replaces the usage of a defaultDistribution for javaRestTest by the integTestDistribution.

This has a few advantages:
1. The overall dependencies on running the java rest tests are minimized.
   By using the default distribution we rely on building the whole default distribution (including all modules and plugins)
   before we can run these tests. This a) takes time and b) dramatically reduces the likelyhood of us avoiding test task execution at all as we
   basically declare the whole distro as an input. By using integTest distro we reduce the surface of the inputs dramatically which also results in faster
   execution of these tests
2. its more idiomatic as one pattern we see is that we e.g  disable the security settings that we would not need once we use the integTest distro without
   the security plugin
3. it makes test setup and its dependencies more explicit.

Point 3. might sound as like a downside at first, but for understanding what those tests are doing and what they are relying on I think its worth the 3 more lines of code.

Here are two build scans task executions:
- before the `javaRestTest` task requires `995 tasks, 2 transforms executed in 155 projects`: https://gradle-enterprise.elastic.co/s/drj5mfzsfx7ra/timeline
- after we only rely on `275 tasks, 2 transforms executed in 56 projects`: https://gradle-enterprise.elastic.co/s/jr5sblhppn4fg/timeline?page=2
Removes testing compatibility switches for versions we no longer test
against.
Currently, we use NamedWriteable for serializing blocks. While 
convenient, it incurs a noticeable performance penalty when pages 
contain thousands of blocks. Since block types are small and already
centered in ElementType, we can safely switch from NamedWriteable to
typed code. For example, the NamedWriteable alone of a small page with
10K fields would be 180KB, whereas the new method reduces it to 10KB.
Below are the serialization improvements with FROM idx | LIMIT 10000
where the target index has 10K fields:

- write_exchange_response executed 173 times took: 73.2ms -> 26.7ms
- read_exchange_response executed 173 times took: 49.4ms -> 25.8ms
The test that is verifying timeout handling when pulling a scorer supplier is
going to always score the entire segment. The test needs to be adjusted
accordingly. While at it, I added docs and clarified the tests a bit, as well
as added a couple new tests that cover for timeout handling when retrieving a
scorer from the scorer supplier.

Closes elastic#124140

* [CI] Auto commit changes from spotless

* iter

* iter

* spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
…tic#124053)

The old lucene versions plugin allows users to read indices created by ancient
Elasticsearch version, starting from 5.0. Especially for 5.x which relied on Lucene
6.x, some special logic is required around postings format support. That revolves
around reading of FSTs, but has a consequence of requiring quite a few fork of
other Lucene classes due to their visibility.

This commit attempts to add javadocs to clarify the intent of some of these classes.
It also includes some simplifications, in that Lucene50PostingsReader can be reused as-is
and some other classes are only needed in tests hence are moved to the test folder.
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
Inter-segment concurrency is disabled whenever sort by field, included script sorting, is used in a search request.

The reason why sort by field does not use concurrency is that there are some performance implications, given that the hit queue in Lucene is build per slice and the different search threads don't share information about the documents they have already visited etc.

The reason why script sort has concurrency disabled is that the script sorting implementation is not thread safe. This commit addresses such concurrency issue and re-enables search concurrency for search requests that use script sorting. In addition, missing tests are added to cover for sort scripts that rely on _score being available and top_hits aggregation with a scripted sort clause.
…ntYamlTestSuiteIT test {yaml=data_stream/190_failure_store_redirection/Redirect ingest failure in data stream to failure store} elastic#124518
This PR makes a change to the existing Java REST test DetectionRulesIT.testCondition such that it updates detection rules for a running job. Previously it had relied on closing and re-opening the job for the update to take effect.

Relates elastic/ml-cpp#2821
* ESQL: Lazy collection copying during node transform

A set of optimization for tree traversal:
1. perform lazy copying during children transform
2. use long hashing to avoid object creation
3. perform type check first before collection checking

Relates elastic#124395
* Introduce IndexReshardingMetadata

This adds to IndexMetadata the persistent state
we will need to track while a split is in progress.
Nothing outside test code sets it yet, so it doesn't
introduce any wire changes yet.

Followups will consult this to make routing decisions and
handle backward compatibility if the object is present
in metadata.
@georgewallace georgewallace requested review from a team as code owners March 11, 2025 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes Team:Docs Meta label for docs team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.