ES|QL Fork Command

Fork is a foundational building block to support multiple-subqueries, RRF, and much more.

### What is FORK? 

Conceptually, fork is:
1. a bifurcation of the stream, with all data going to each fork branch, followed by
2. a merge of the branches, enhanced with a discriminator column

The name, fork, is somewhat inspired by unix fork, and other streamy frameworks, since the concept of forked execution is quite familiar. Other names considered and discounted are: union, merge, combine, tee, tpipe. While conceptually similar, the aforementioned names would likely lead to confusion with similar (but different) concepts in other languages, e.g. SQL union.

Example:
```
FROM test
| FORK
    ( WHERE content:"fox" )
    ( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content
```

Conceptual data flow:

<img width="1026" alt="Image" src="https://github.com/user-attachments/assets/519d8fb8-f9a3-4d50-8ece-d5883e044a85" />

Actual execution flow:
The planner and execution is free to reorganise things as long as it adheres to the conceptual flow of data.
 
Building upon the previous example, now with a common pre-filter:

```
FROM test
| WHERE id > 1  // common pre-filter
| FORK
    ( WHERE content:"fox" )
    ( WHERE content:"dog" )
| SORT _fork
| KEEP _fork, id, content
```

Where the `FORK` is “pushable”, then the common pre-filter and the `WHERE` of each fork branch is pushed down to be an effective subquery.

<img width="749" alt="Image" src="https://github.com/user-attachments/assets/e4d60ae5-77f3-4d70-98ca-fdf0ac959dda" />

Where the `FORK` is not pushable, e.g. after a `STATS`, then the fork implementation will “fan-out” and merge within the compute engine. That is, the implementation will be more like the initial conceptual diagram above.


### Initial Restrictions

A number of initial restrictions have been put in place in order to make progress and unblock other development efforts dependent on Fork, e.g. RRF.

The restrictions are:
1. First level data retrieval only - not yet general purpose bifurcation of the stream. This allows us to support multiple different subqueries. For bifurcation of the stream, then the planner will have to determine that the fork is actually being performed in second stage retrieval. This is a pragmatic limitation that we can lift later.
2. All branches of the fork must return the same data scheme (same columns). This is a pragmatic limitation that we can lift later. For this reason, only WHERE, SORT, and LIMIT, are supported within fork subqueries.
3. No fork within a fork. This is a pragmatic limitation that we can lift later.
4. Lucene queries are independent - no point-in-time. We can add this later
5. Fork branches are automatically named. We can provide the ability to name the branches later.

### Development outline and evolution

We *will* lift all the restrictions as outlined above, but not all at once and not necessarily in the outlined order.  

Since FORK is a significant feature, its development will be broken down over several other smaller PRs and issues. This section is intended to capture the current state and future plans as we progress towards a complete implementation. As such, consider this section "live", as new PRs and issues are filed they can be linked here.

 - [x] https://github.com/elastic/elasticsearch/issues/121950
 - [x] https://github.com/elastic/elasticsearch/issues/126553
 - [ ] use union types to resolve schema conflicts between fork branches when possible
 - [ ] Nested Fork commands
 - [ ] Support FORK after FORK

### Bugs:
These are all the known bugs with FORK:

- [ ] https://github.com/elastic/elasticsearch/issues/130072
- [ ] FORK can add extra an extra warning header when implicit date nanos conversion fails for some rows (when the values indicate a pre 1970 time). This will likely be fixed when we improve the field caps resolution. This caused the following test failures:
  -  https://github.com/elastic/elasticsearch/issues/129229
  -  https://github.com/elastic/elasticsearch/issues/129228
  -  https://github.com/elastic/elasticsearch/issues/129210
- [ ] queries that are using the implicit date nanos conversion might return extra columns. Note this only happens when using union types with date nanos/datetime. The date nanos/datetime conversion is under snapshot, which makes this less urgent to fix. (see https://github.com/elastic/elasticsearch/pull/130026)
The following query returns an extra `$$nanos$converted_to$date_nanos` column:
```
FROM date_nanos, date_nanos_union_types
| EVAL a = nanos::date_nanos::datetime
| WHERE millis < "2023-10-23T13:00:00"
| LIMIT 1
| FORK (where true) (where true)
```

Follow ups from https://github.com/elastic/elasticsearch/issues/121950:

- [ ] improve field resolution https://github.com/elastic/elasticsearch/issues/127208
- [ ] FORK and unmapped fields with INSIST (INSIST is still WIP so this is not urgent for now)
- [ ] fix CCS support https://github.com/elastic/elasticsearch/pull/127309 https://github.com/elastic/elasticsearch/pull/128310
- [ ] Guides/more examples on when to use FORK




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ES|QL Fork Command #121652

What is FORK?

Development outline and evolution

Bugs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ES|QL Fork Command #121652

Description

What is FORK?

Development outline and evolution

Bugs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions