lookup join first draft #123719

georgewallace · 2025-02-28T15:26:10Z

No description provided.

elasticsearchmachine · 2025-02-28T15:26:52Z

Pinging @elastic/es-docs (Team:Docs)

docs/reference/query-languages/esql/esql-commands.md

docs/reference/query-languages/esql/esql-lookup-join.md

alex-spies

Thank you @georgewallace , this is shaping up nicely!

I have another round of comments.

docs/reference/query-languages/esql/esql-commands.md

alex-spies · 2025-03-03T17:31:52Z

docs/reference/query-languages/esql/esql-commands.md

+```
+
+
+In case of name collisions, the newly created columns will override existing columns.


We should also provide examples of this.

docs/reference/query-languages/esql/esql-commands.md

docs/reference/query-languages/esql/esql-enrich-data.md

docs/reference/query-languages/esql/esql-lookup-join.md

alex-spies · 2025-03-03T17:54:56Z

docs/reference/query-languages/esql/esql-lookup-join.md

+* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported.
+* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it.
+* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes.
+  * This limit is per page of data which is about about 10,000 rows.


I don't know if pages of data are a well-defined concept (at least user-facing). I also don't know if we can say this is about 10k rows. (@nik9000 may have more precise ideas on this.)

It would be good to know the exacts here, I took a stab based on what I read but unsure what the limit is. Also will they get a specific messagE?

I'll check in with Nik to get a better, but still precise wording here.

Also, I don't think they will get a specific error message, probably just a generic circuit breaker exception.

Suggestion:

The query will circuit break if there are too many matching documents in the lookup index, or if the documents are too large. More precisely, `LOOKUP JOIN` works in batches of, normally, about 10,000 rows; a large amount of heap space is needed if the matching documents from the lookup index for a batch are multiple megabytes or larger. This is roughly the same as for `ENRICH`.

@nik9000 , could you please keep me honest here?

I like Alex's proposal though I might say "about" instead of "normally". It is a bit fuzzy, but that's what you get when you don't describe pages to users precisely.

One thing that we could add is that larger nodes will allow bigger fetches.

This is fairly temporary - we should switch to a stream of results at some point.

docs/reference/query-languages/esql/esql-lookup-join.md

Co-authored-by: Alexander Spies <[email protected]>

alex-spies

Heya, thanks for the iterations. I think this is shaping up nicely.

docs/reference/query-languages/esql/esql-commands.md

docs/reference/query-languages/esql/esql-lookup-join.md

alex-spies · 2025-03-05T11:13:31Z

docs/reference/query-languages/esql/esql-enrich-data.md

@@ -17,11 +17,12 @@ For example, you can use `ENRICH` to:

 [`ENRICH`](/reference/query-languages/esql/esql-commands.md#esql-enrich) is similar to [`LOOKUP join`](/reference/query-languages/esql/esql-commands.md#esql-lookup-join) in the fact that they both help you join data together. You should use `ENRICH` when:

-* Enrichment data doesn't changes frequently
+* Enrichment data doesn't change frequently
 * You can accept index-time overhead
 * You are working with structured enrichment patterns
 * You can accept having multiple matches combined into multi-values
 * You can accept being limited to predefined match fields


The matching logic of "which lookup/enrich document matches a given input row" is also more lenient for ENRICH compared to LOOKUP JOIN. For instance, for LOOKUP JOIN, multivalued join keys like ["foo", "bar"] don't match anything if they occur in the main or lookup index; for ENRICH, there'll be a match if any value for the main index' join key is contained in the lookup index' join key.

I'll gather more precise information for this, we can maybe add this in a follow-up PR.

docs/reference/query-languages/esql/esql-enrich-data.md

docs/reference/query-languages/esql/esql-lookup-join.md

alex-spies · 2025-03-05T11:39:47Z

docs/reference/query-languages/esql/esql-lookup-join.md

+* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported.
+* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it.
+* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes.
+  * This limit is per page of data which is about about 10,000 rows.


I'll check in with Nik to get a better, but still precise wording here.

docs/reference/query-languages/esql/esql-lookup-join.md

leemthompo

Thanks for driving this @georgewallace 🙏 , I have a few ideas and suggestions.

Other than my comments some other thoughts:

We need applies_to tags
- Makes me ask the question if there are any notable differences between serverless and hosted or self-managed deployments for LOOKUP_JOIN?
Might be helpful to add example responses where possible

docs/reference/query-languages/esql/esql-enrich-data.md

docs/reference/query-languages/esql/esql-lookup-join.md

docs/reference/query-languages/esql/esql-commands.md

leemthompo · 2025-03-05T11:43:23Z

docs/reference/query-languages/esql/esql-commands.md

+`LOOKUP JOIN` is useful for any scenario where you need to pull in information from a lookup index to streamline data enrichment and analysis.
+
+**Syntax**
+


I guess should mention the raw syntax at top:

FROM <source_index> | LOOKUP JOIN <lookup_index> ON <field_name>

docs/reference/query-languages/esql/esql-commands.md

docs/reference/query-languages/esql/esql-lookup-join.md

docs/reference/query-languages/esql/esql-commands.md

Co-authored-by: Liam Thompson <[email protected]>

docs/reference/query-languages/esql/esql-lookup-join.md

alex-spies · 2025-03-06T09:59:32Z

docs/reference/query-languages/esql/esql-commands.md

-TBD
+The `LOOKUP JOIN` command adds new columns to your {esql} query results table by finding documents in a lookup index that share the same join field value as your result rows.
+
+For each row in your results table that matches a document in the lookup index based on the join field, all fields from the matching document are added as new columns to that row.


nit: all fields except the join key are added from the lookup index, the join key column is preserved.

docs/reference/query-languages/esql/esql-lookup-join.md

Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Alexander Spies <[email protected]>

This isn't even saving any lines of code and is a measurable source of both cache and branch-prediction misses on the hot and critical transport-thread path. => inline it

This replaces the usage of a defaultDistribution for javaRestTest by the integTestDistribution. This has a few advantages: 1. The overall dependencies on running the java rest tests are minimized. By using the default distribution we rely on building the whole default distribution (including all modules and plugins) before we can run these tests. This a) takes time and b) dramatically reduces the likelyhood of us avoiding test task execution at all as we basically declare the whole distro as an input. By using integTest distro we reduce the surface of the inputs dramatically which also results in faster execution of these tests 2. its more idiomatic as one pattern we see is that we e.g disable the security settings that we would not need once we use the integTest distro without the security plugin 3. it makes test setup and its dependencies more explicit. Point 3. might sound as like a downside at first, but for understanding what those tests are doing and what they are relying on I think its worth the 3 more lines of code. Here are two build scans task executions: - before the `javaRestTest` task requires `995 tasks, 2 transforms executed in 155 projects`: https://gradle-enterprise.elastic.co/s/drj5mfzsfx7ra/timeline - after we only rely on `275 tasks, 2 transforms executed in 56 projects`: https://gradle-enterprise.elastic.co/s/jr5sblhppn4fg/timeline?page=2

Fixes elastic#124165

Removes testing compatibility switches for versions we no longer test against.

elastic#124430)

Currently, we use NamedWriteable for serializing blocks. While convenient, it incurs a noticeable performance penalty when pages contain thousands of blocks. Since block types are small and already centered in ElementType, we can safely switch from NamedWriteable to typed code. For example, the NamedWriteable alone of a small page with 10K fields would be 180KB, whereas the new method reduces it to 10KB. Below are the serialization improvements with FROM idx | LIMIT 10000 where the target index has 10K fields: - write_exchange_response executed 173 times took: 73.2ms -> 26.7ms - read_exchange_response executed 173 times took: 49.4ms -> 25.8ms

The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

…tic#124053) The old lucene versions plugin allows users to read indices created by ancient Elasticsearch version, starting from 5.0. Especially for 5.x which relied on Lucene 6.x, some special logic is required around postings format support. That revolves around reading of FSTs, but has a consequence of requiring quite a few fork of other Lucene classes due to their visibility. This commit attempts to add javadocs to clarify the intent of some of these classes. It also includes some simplifications, in that Lucene50PostingsReader can be reused as-is and some other classes are only needed in tests hence are moved to the test folder.

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

Inter-segment concurrency is disabled whenever sort by field, included script sorting, is used in a search request. The reason why sort by field does not use concurrency is that there are some performance implications, given that the hit queue in Lucene is build per slice and the different search threads don't share information about the documents they have already visited etc. The reason why script sort has concurrency disabled is that the script sorting implementation is not thread safe. This commit addresses such concurrency issue and re-enables search concurrency for search requests that use script sorting. In addition, missing tests are added to cover for sort scripts that rely on _score being available and top_hits aggregation with a scripted sort clause.

…ntYamlTestSuiteIT test {yaml=data_stream/190_failure_store_redirection/Redirect ingest failure in data stream to failure store} elastic#124518

This PR makes a change to the existing Java REST test DetectionRulesIT.testCondition such that it updates detection rules for a running job. Previously it had relied on closing and re-opening the job for the update to take effect. Relates elastic/ml-cpp#2821

* ESQL: Lazy collection copying during node transform A set of optimization for tree traversal: 1. perform lazy copying during children transform 2. use long hashing to avoid object creation 3. perform type check first before collection checking Relates elastic#124395

* Introduce IndexReshardingMetadata This adds to IndexMetadata the persistent state we will need to track while a split is in progress. Nothing outside test code sets it yet, so it doesn't introduce any wire changes yet. Followups will consult this to make routing decisions and handle backward compatibility if the object is present in metadata.

georgewallace added >docs General docs changes Team:Docs Meta label for docs team labels Feb 28, 2025

elasticsearchmachine added the v9.1.0 label Feb 28, 2025

alex-spies reviewed Feb 28, 2025

View reviewed changes

docs/reference/query-languages/esql/esql-commands.md Show resolved Hide resolved

docs/reference/query-languages/esql/esql-lookup-join.md Outdated Show resolved Hide resolved

docs/reference/query-languages/esql/esql-lookup-join.md Outdated Show resolved Hide resolved

georgewallace force-pushed the lookup_join branch from 2fc0dfd to f0cdcec Compare February 28, 2025 20:48

georgewallace added 10 commits March 3, 2025 09:38

lookup join first draft

d8c77e3

fixing link

f5c5961

correcting image

9c92e58

updates

224656e

fixing build error

e3d94a1

still fixing

a6dec83

updates

13b73a6

providing example

1714289

fixing image

47a8bf6

fixing example

c64bfb9

georgewallace force-pushed the lookup_join branch from 981267d to c64bfb9 Compare March 3, 2025 16:39

alex-spies reviewed Mar 3, 2025

View reviewed changes

georgewallace and others added 5 commits March 4, 2025 09:26

Apply suggestions from code review

1632577

Co-authored-by: Alexander Spies <[email protected]>

fixes

8ae5d05

corrections based on alex's feedback

6ffccb5

fixes

999472d

updates

b5fd81f

alex-spies reviewed Mar 5, 2025

View reviewed changes

leemthompo reviewed Mar 5, 2025

View reviewed changes

Apply suggestions from code review

74d627a

Co-authored-by: Liam Thompson <[email protected]>

alex-spies reviewed Mar 6, 2025

View reviewed changes

leemthompo reviewed Mar 10, 2025

View reviewed changes

docs/reference/query-languages/esql/esql-lookup-join.md Outdated Show resolved Hide resolved

leemthompo reviewed Mar 10, 2025

View reviewed changes

docs/reference/query-languages/esql/esql-lookup-join.md Outdated Show resolved Hide resolved

Apply suggestions from code review

140bb1f

Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Alexander Spies <[email protected]>

thecoop and others added 20 commits March 10, 2025 18:59

Remove 7.15 transport versions (elastic#124193)

8184026

Fix entitlement checks for relative links (elastic#124133)

80c1b86

Remove SearchOperationListenerExecutor abstraction (elastic#124298)

7e1936a

This isn't even saving any lines of code and is a measurable source of both cache and branch-prediction misses on the hot and critical transport-thread path. => inline it

Remove 7.17 transport versions (elastic#124196)

e090bde

ESQL: Unmute and fix BlockMultiValuedTests.testToMask (elastic#124339)

d4065d0

Fixes elastic#124165

TSDB: Remove test compatibility for untested (elastic#124113)

b95fc4c

Removes testing compatibility switches for versions we no longer test against.

Fix FileAccessTreeTests#testDuplicateExclusivePaths to work on windows (

514be2c

elastic#124430)

fix file tests to work across multiple invocations (elastic#124412)

a93865c

Ignore ordering in policy manager exclulsive tests (elastic#124488)

6611ee2

[TEST] Remove BlockMultiValuedTests from muted tests list

4721d1c

This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.

Mute org.elasticsearch.multiproject.test.CoreWithMultipleProjectsClie…

d97b414

…ntYamlTestSuiteIT test {yaml=data_stream/190_failure_store_redirection/Redirect ingest failure in data stream to failure store} elastic#124518

stuff

6292e95

georgewallace requested review from a team as code owners March 11, 2025 01:41

fixing rebase

fa6c1d3

georgewallace force-pushed the lookup_join branch from 67ee17d to fa6c1d3 Compare March 11, 2025 02:07

georgewallace closed this Mar 11, 2025

alex-spies mentioned this pull request Mar 11, 2025

lookup join docs #124531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lookup join first draft #123719

lookup join first draft #123719

georgewallace commented Feb 28, 2025

elasticsearchmachine commented Feb 28, 2025

alex-spies left a comment

alex-spies Mar 3, 2025

alex-spies Mar 3, 2025

georgewallace Mar 4, 2025

alex-spies Mar 5, 2025

alex-spies Mar 5, 2025

alex-spies Mar 5, 2025 •

edited

Loading

nik9000 Mar 5, 2025

alex-spies left a comment

alex-spies Mar 5, 2025

alex-spies Mar 5, 2025

leemthompo left a comment

leemthompo Mar 5, 2025

alex-spies Mar 6, 2025

		```


		In case of name collisions, the newly created columns will override existing columns.

		`LOOKUP JOIN` is useful for any scenario where you need to pull in information from a lookup index to streamline data enrichment and analysis.

		Syntax

lookup join first draft #123719

lookup join first draft #123719

Conversation

georgewallace commented Feb 28, 2025

elasticsearchmachine commented Feb 28, 2025

alex-spies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leemthompo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-spies Mar 5, 2025 •

edited

Loading