Skip to content

ES|QL Fork Command #121652

Open
Enhancement
@ChrisHegarty

Description

@ChrisHegarty

Fork is a foundational building block to support multiple-subqueries, RRF, and much more.

What is FORK?

Conceptually, fork is:

  1. a bifurcation of the stream, with all data going to each fork branch, followed by
  2. a merge of the branches, enhanced with a discriminator column

The name, fork, is somewhat inspired by unix fork, and other streamy frameworks, since the concept of forked execution is quite familiar. Other names considered and discounted are: union, merge, combine, tee, tpipe. While conceptually similar, the aforementioned names would likely lead to confusion with similar (but different) concepts in other languages, e.g. SQL union.

Example:

FROM test
| FORK
    ( WHERE content:"fox" )
    ( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content

Conceptual data flow:

Image

Actual execution flow:
The planner and execution is free to reorganise things as long as it adheres to the conceptual flow of data.

Building upon the previous example, now with a common pre-filter:

FROM test
| WHERE id > 1  // common pre-filter
| FORK
    ( WHERE content:"fox" )
    ( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content

Where the FORK is “pushable”, then the common pre-filter and the WHERE of each fork branch is pushed down to be an effective subquery.

Image

Where the FORK is not pushable, e.g. after a STATS, then the fork implementation will “fan-out” and merge within the compute engine. That is, the implementation will be more like the initial conceptual diagram above.

### Initial Restrictions

A number of initial restrictions have been put in place in order to make progress and unblock other development efforts dependent on Fork, e.g. RRF.

The restrictions are:

  1. First level data retrieval only - not yet general purpose bifurcation of the stream. This allows us to support multiple different subqueries. For bifurcation of the stream, then the planner will have to determine that the fork is actually being performed in second stage retrieval. This is a pragmatic limitation that we can lift later.
  2. All branches of the fork must return the same data scheme (same columns). This is a pragmatic limitation that we can lift later. For this reason, only WHERE, SORT, and LIMIT, are supported within fork subqueries.
  3. No fork within a fork. This is a pragmatic limitation that we can lift later.
  4. Lucene queries are independent - no point-in-time. We can add this later
  5. Fork branches are automatically named. We can provide the ability to name the branches later.

Development outline and evolution

We will lift all the restrictions as outlined above, but not all at once and not necessarily in the outlined order.

Since FORK is a significant feature, its development will be broken down over several other smaller PRs and issues. This section is intended to capture the current state and future plans as we progress towards a complete implementation. As such, consider this section "live", as new PRs and issues are filed they can be linked here.

Bugs:

These are all the known bugs with FORK:

FROM date_nanos, date_nanos_union_types
| EVAL a = nanos::date_nanos::datetime
| WHERE millis < "2023-10-23T13:00:00"
| LIMIT 1
| FORK (where true) (where true)

Follow ups from #121950:

Metadata

Metadata

Labels

:Search Relevance/SearchCatch all for Search RelevanceMetaTeam:Search RelevanceMeta label for the Search Relevance team in Elasticsearchpriority:highA label for assessing bug priority to be used by ES engineersv9.2.0

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions