-
Notifications
You must be signed in to change notification settings - Fork 25.2k
lookup join first draft #123719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lookup join first draft #123719
Conversation
Pinging @elastic/es-docs (Team:Docs) |
2fc0dfd
to
f0cdcec
Compare
981267d
to
c64bfb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @georgewallace , this is shaping up nicely!
I have another round of comments.
``` | ||
|
||
|
||
In case of name collisions, the newly created columns will override existing columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also provide examples of this.
* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported. | ||
* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it. | ||
* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes. | ||
* This limit is per page of data which is about about 10,000 rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if pages of data are a well-defined concept (at least user-facing). I also don't know if we can say this is about 10k rows. (@nik9000 may have more precise ideas on this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to know the exacts here, I took a stab based on what I read but unsure what the limit is. Also will they get a specific messagE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check in with Nik to get a better, but still precise wording here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't think they will get a specific error message, probably just a generic circuit breaker exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
The query will circuit break if there are too many matching documents in the lookup index, or if the documents are too large.
More precisely, `LOOKUP JOIN` works in batches of, normally, about 10,000 rows; a large amount of heap space is needed if the matching documents from the lookup index for a batch are multiple megabytes or larger.
This is roughly the same as for `ENRICH`.
@nik9000 , could you please keep me honest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Alex's proposal though I might say "about" instead of "normally". It is a bit fuzzy, but that's what you get when you don't describe pages to users precisely.
One thing that we could add is that larger nodes will allow bigger fetches.
This is fairly temporary - we should switch to a stream of results at some point.
Co-authored-by: Alexander Spies <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, thanks for the iterations. I think this is shaping up nicely.
@@ -17,11 +17,12 @@ For example, you can use `ENRICH` to: | |||
|
|||
[`ENRICH`](/reference/query-languages/esql/esql-commands.md#esql-enrich) is similar to [`LOOKUP join`](/reference/query-languages/esql/esql-commands.md#esql-lookup-join) in the fact that they both help you join data together. You should use `ENRICH` when: | |||
|
|||
* Enrichment data doesn't changes frequently | |||
* Enrichment data doesn't change frequently | |||
* You can accept index-time overhead | |||
* You are working with structured enrichment patterns | |||
* You can accept having multiple matches combined into multi-values | |||
* You can accept being limited to predefined match fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matching logic of "which lookup/enrich document matches a given input row" is also more lenient for ENRICH compared to LOOKUP JOIN. For instance, for LOOKUP JOIN
, multivalued join keys like ["foo", "bar"]
don't match anything if they occur in the main or lookup index; for ENRICH
, there'll be a match if any value for the main index' join key is contained in the lookup index' join key.
I'll gather more precise information for this, we can maybe add this in a follow-up PR.
* `LOOKUP JOIN` can only use a single match field, and can only use a single index. Wildcards, aliases, and datastreams are not supported. | ||
* The name of the match field in `LOOKUP JOIN lu_idx ON match_field` must match an existing field in the query. This may require renames or evals to achieve it. | ||
* The query will circuit break if you fetch too much data in a single page. A large heap is needed to manage results of multiple megabytes. | ||
* This limit is per page of data which is about about 10,000 rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check in with Nik to get a better, but still precise wording here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for driving this @georgewallace 🙏 , I have a few ideas and suggestions.
Other than my comments some other thoughts:
- We need applies_to tags
- Makes me ask the question if there are any notable differences between serverless and hosted or self-managed deployments for LOOKUP_JOIN?
- Might be helpful to add example responses where possible
`LOOKUP JOIN` is useful for any scenario where you need to pull in information from a lookup index to streamline data enrichment and analysis. | ||
|
||
**Syntax** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess should mention the raw syntax at top:
FROM <source_index>
| LOOKUP JOIN <lookup_index> ON <field_name>
Co-authored-by: Liam Thompson <[email protected]>
TBD | ||
The `LOOKUP JOIN` command adds new columns to your {esql} query results table by finding documents in a lookup index that share the same join field value as your result rows. | ||
|
||
For each row in your results table that matches a document in the lookup index based on the join field, all fields from the matching document are added as new columns to that row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: all fields except the join key are added from the lookup index, the join key column is preserved.
Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Alexander Spies <[email protected]>
This isn't even saving any lines of code and is a measurable source of both cache and branch-prediction misses on the hot and critical transport-thread path. => inline it
This replaces the usage of a defaultDistribution for javaRestTest by the integTestDistribution. This has a few advantages: 1. The overall dependencies on running the java rest tests are minimized. By using the default distribution we rely on building the whole default distribution (including all modules and plugins) before we can run these tests. This a) takes time and b) dramatically reduces the likelyhood of us avoiding test task execution at all as we basically declare the whole distro as an input. By using integTest distro we reduce the surface of the inputs dramatically which also results in faster execution of these tests 2. its more idiomatic as one pattern we see is that we e.g disable the security settings that we would not need once we use the integTest distro without the security plugin 3. it makes test setup and its dependencies more explicit. Point 3. might sound as like a downside at first, but for understanding what those tests are doing and what they are relying on I think its worth the 3 more lines of code. Here are two build scans task executions: - before the `javaRestTest` task requires `995 tasks, 2 transforms executed in 155 projects`: https://gradle-enterprise.elastic.co/s/drj5mfzsfx7ra/timeline - after we only rely on `275 tasks, 2 transforms executed in 56 projects`: https://gradle-enterprise.elastic.co/s/jr5sblhppn4fg/timeline?page=2
Removes testing compatibility switches for versions we no longer test against.
Currently, we use NamedWriteable for serializing blocks. While convenient, it incurs a noticeable performance penalty when pages contain thousands of blocks. Since block types are small and already centered in ElementType, we can safely switch from NamedWriteable to typed code. For example, the NamedWriteable alone of a small page with 10K fields would be 180KB, whereas the new method reduces it to 10KB. Below are the serialization improvements with FROM idx | LIMIT 10000 where the target index has 10K fields: - write_exchange_response executed 173 times took: 73.2ms -> 26.7ms - read_exchange_response executed 173 times took: 49.4ms -> 25.8ms
The test that is verifying timeout handling when pulling a scorer supplier is going to always score the entire segment. The test needs to be adjusted accordingly. While at it, I added docs and clarified the tests a bit, as well as added a couple new tests that cover for timeout handling when retrieving a scorer from the scorer supplier. Closes elastic#124140 * [CI] Auto commit changes from spotless * iter * iter * spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
…tic#124053) The old lucene versions plugin allows users to read indices created by ancient Elasticsearch version, starting from 5.0. Especially for 5.x which relied on Lucene 6.x, some special logic is required around postings format support. That revolves around reading of FSTs, but has a consequence of requiring quite a few fork of other Lucene classes due to their visibility. This commit attempts to add javadocs to clarify the intent of some of these classes. It also includes some simplifications, in that Lucene50PostingsReader can be reused as-is and some other classes are only needed in tests hence are moved to the test folder.
This was added as a result of merging elastic#124327, via a bad merge. Fix that with this commit.
Inter-segment concurrency is disabled whenever sort by field, included script sorting, is used in a search request. The reason why sort by field does not use concurrency is that there are some performance implications, given that the hit queue in Lucene is build per slice and the different search threads don't share information about the documents they have already visited etc. The reason why script sort has concurrency disabled is that the script sorting implementation is not thread safe. This commit addresses such concurrency issue and re-enables search concurrency for search requests that use script sorting. In addition, missing tests are added to cover for sort scripts that rely on _score being available and top_hits aggregation with a scripted sort clause.
…ntYamlTestSuiteIT test {yaml=data_stream/190_failure_store_redirection/Redirect ingest failure in data stream to failure store} elastic#124518
This PR makes a change to the existing Java REST test DetectionRulesIT.testCondition such that it updates detection rules for a running job. Previously it had relied on closing and re-opening the job for the update to take effect. Relates elastic/ml-cpp#2821
* ESQL: Lazy collection copying during node transform A set of optimization for tree traversal: 1. perform lazy copying during children transform 2. use long hashing to avoid object creation 3. perform type check first before collection checking Relates elastic#124395
* Introduce IndexReshardingMetadata This adds to IndexMetadata the persistent state we will need to track while a split is in progress. Nothing outside test code sets it yet, so it doesn't introduce any wire changes yet. Followups will consult this to make routing decisions and handle backward compatibility if the object is present in metadata.
67ee17d
to
fa6c1d3
Compare
No description provided.