Stream result pages from sub plans for FORK #126705

ioanatia · 2025-04-11T15:27:46Z

tracked in #126389

We move away from the INLINESTATS execution model for FORK and instead FORK is present, we break the physical plan into sub plans and a main coordinator plan at the ComputeService level.
The sub plans are further broken down into an (optional) data node plan and coordinator plan.

To funnel pages between the sub plans and the main coordinator plan we use ExchangeSink/ExchangeSource.
Take as an example the following query:

FROM test
| FORK
    ( WHERE content:"fox" ) // sub plan 1
    ( WHERE content:"dog" ) // sub plan 2
| SORT _fork
| KEEP _fork, id, content

The execution will be split into the following plans:

I also removed the MergeOperator since it was no longer needed.
Previously the Fork logical plan would be translated to MergeExec which would be planned using MergeOperator.
MergeOperator would simply funnel the pages from the sub plans.

ChrisHegarty

This worked out very clean. If there is any possibility to refactor common code snippets into smaller functions that would be good to try, but otherwise I think that it in great shape.

ChrisHegarty · 2025-04-14T08:38:37Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java

         */
-        FORK_V2(Build.current().isSnapshot()),
+        FORK_V3(Build.current().isSnapshot()),


It seems obvious now, but I like how this capability can be effectively incremented (without affecting previous releases and or over excessively polluting). This could be a good pattern to socialise.

I noticed this approach with lookup join and it made sense to reuse here.

ChrisHegarty · 2025-04-14T08:44:58Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/PlannerUtils.java

@@ -67,6 +69,25 @@

 public class PlannerUtils {

+    public static Tuple<List<PhysicalPlan>, PhysicalPlan> breakPlanIntoSubPlansAndMainPlan(PhysicalPlan plan) {


Can you please add a short javadoc description.

nik9000

Nice. I like it. Probably worth having @costin have a look at too.

nik9000 · 2025-04-15T17:21:34Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RrfScoreEvalOperator.java

+        try {
+            return page.projectBlocks(projections);
+        } finally {
+            page.releaseBlocks();


Yes - when we changed the implementation of FORK, the RRF tests started failing.
In the tests RRF always follows FORK.
Example of a failure:

java.lang.AssertionError: circuit breakers not reset to 0Expected a map containing estimated_size_in_bytes: expected <0> but was <8576> estimated_size: expected "0b" but was "8.3kb" overhead: <1.0> unexpected but ok limit_size_in_bytes: <322122547> unexpected but ok limit_size: "307.1mb" unexpected but ok tripped: <0> unexpected but ok at org.elasticsearch.test.MapMatcher.assertMap(MapMatcher.java:85) at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.lambda$assertRequestBreakerEmpty$0(EsqlSpecTestCase.java:429) at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1519) at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1491) at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.assertRequestBreakerEmpty(EsqlSpecTestCase.java:421) at org.elasticsearch.xpack.esql.qa.rest.EsqlSpecTestCase.assertRequestBreakerEmptyAfterTests(EsqlSpecTestCase.java:417) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)

Previously the pages that were processed by the RrfScoreEvalOperator were coming from the pages that were stored in LocalSourceExec and that were representing the sub plan results that we set in EsqlSession.
Now the pages are coming from an ExchangeSource.
Once I made this change to release the blocks, the tests started working again.

dnhatn · 2025-04-15T17:29:08Z

I wonder if we should break into sub-plans on data nodes instead to avoid these two issues:

Reader contexts: Currently, sub-plans can be executed with different reader contexts, which might lead to inconsistent results between forks.
Overhead: Splitting sub-plans on the coordinator and executing each sub-plan on every cluster separately, including remote clusters, can result in significant overhead.

ioanatia · 2025-04-15T20:37:26Z

@dnhatn

I wonder if we should break into sub-plans on data nodes instead to avoid these two issues:

One reason why it was done this way was because each fork branch would actually need to be split into a data node plan and a coordinator plan.

For example - the following branches are quite different but both have a data node + coordinator plans:

FROM employees METADATA _score
| WHERE match(first_name, "John")
| FORK
       (SORT _score DESC | LIMIT 10) // keep just the 10 top hits
       (STATS total = COUNT(*)) // return the total

We also add a default LIMIT to each FORK branch - which I guess means that each FORK branch would automatically have a coordinator plan that will need to cap the number of results to the given LIMIT.

Reader contexts: Currently, sub-plans can be executed with different reader contexts, which might lead to inconsistent results between forks.

If we create and use a point in time, would this solve the inconsistency issue?
Retrievers in _search are somehow similar, they issue multiple queries to the shards and use PIT.
We have PIT as a requirement for FORK, but not necessarily for tech preview.

nik9000 · 2025-04-15T21:24:13Z

I wonder if we should break into sub-plans on data nodes instead to avoid these two issues:

One reason why it was done this way was because each fork branch would actually need to be split into a data node plan and a coordinator plan.

Right. I imagined we were doing it this way because we want to support, eventually, stuff like

FORK
  (FROM a | WHERE foo == 1)
  (FROM b | WHERE bar == 1)
| STATS whatever

The real robot legs of it all.

If we create and use a point in time, would this solve the inconsistency issue?
Retrievers in _search are somehow similar, they issue multiple queries to the shards and use PIT.
We have PIT as a requirement for FORK, but not necessarily for tech preview.

PIT is sort of the public facing version of what _search does. Either way it amounts to using the same IndexSearcher for both operations. _search needs it to be able to fetch at all.

There's a real tension between getting a consistent view and analytic style queries that hit a zillion indices. If you are hitting 09832104983214 indices we just don't have the file handles to keep all the readers open all the time. Right now we batch it to a hand full of shard copies per node. Getting a consistent view across both "legs" is doing got require.... Something.

This probably matters a lot for you too - to do a real fetch phase, like, as a second pass from the coordinator node like _search does, means you need to leave the IndexSearchers open. You really want to close the ones that don't return any documents. And close the ones who don't have documents in the top n.

Maybe the best thing to do is to break the plans apart like this on the coordinator but send both plans down at once to the data node and start them using the same IndexSearchers. That feels like it wouldn't be that complex.

I'm unsure if now is the right time for it. A plan like the one you've got is the best we're going to get when the indices are separate.

nik9000

I'd love to see a new test in single_node.RestEsqlIT that checks on the output of the profile. It'd be super duper nice if each of the forked drivers could identify themselves in the profile output.

Ideally we'd also have a test for the tasks output of this too. Like in EsqlActionTaskIT - but it's quite a bit more painful because you have to block execution. On the other hand, if we're going to fork stuff it'd be super mega useful to have trustworthy debug information from tasks and profile.

nik9000 · 2025-04-15T21:28:02Z

I'd love to see a new test in single_node.RestEsqlIT that checks on the output of the profile. It'd be super duper nice if each of the forked drivers could identify themselves in the profile output.

Ideally we'd also have a test for the tasks output of this too. Like in EsqlActionTaskIT - but it's quite a bit more painful because you have to block execution. On the other hand, if we're going to fork stuff it'd be super mega useful to have trustworthy debug information from tasks and profile.

I'm aware this is several more days of work, but I think it's pretty important.

dnhatn · 2025-04-15T21:36:24Z

Maybe the best thing to do is to break the plans apart like this on the coordinator but send both plans down at once to the data node and start them using the same IndexSearchers. That feels like it wouldn't be that complex.

++ I think we can make a fragment contain multiple sub-plans and execute them using the same shard copies.

dnhatn · 2025-04-16T01:14:23Z

An alternative is to pass a list of plans to DataNodeRequest and ClusterComputeRequest.

dnhatn

@ioanatia Since I am working on something similar for time-series, you can merge this as is, and I will make the changes to dispatch multiple plans in a single data-node and cluster request.

elasticsearchmachine · 2025-04-16T07:45:37Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ioanatia · 2025-04-16T12:34:44Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

+        );
+
+        exchangeService.addExchangeSourceHandler(mainSessionId, mainExchangeSource);
+        try (var ignored = mainExchangeSource.addEmptySink()) {


I had to one last change to get this working.

We had a few failures in CI that I could not replicate locally, that were suggesting the exchange source was closing too early/not receiving all pages.

A failure example:

java.lang.AssertionError: Expected more data but no more entries found after [2] | ... Actual: | -- | -- | emp_no:integer \| first_name:keyword \| _fork:keyword | | 10009 \| Sumant \| fork1 | | 10048 \| Florian \| fork1 | | | | Expected: | | emp_no:integer \| first_name:keyword \| _fork:keyword | | 10002 \| Bezalel \| fork2 | | 10009 \| Sumant \| fork1 | | 10014 \| Berni \| fork2 | | 10048 \| Florian \| fork1 | | 10058 \| Berhard \| fork2 | | 10060 \| Breannda \| fork2 | | 10094 \| Arumugam \| fork2

this meant the results for one fork branch did not make it in the exchange source.
to fix this I looked at the existing pattern in ComputeService where on the exchange source we add an empty sink. Luckily the java doc explains why we need an empty sync:

elasticsearch/x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/exchange/ExchangeSourceHandler.java

Lines 288 to 298 in 8c9a091

/**

* Links this exchange source with an empty/dummy remote sink. The purpose of this is to prevent this exchange source from finishing

* until we have performed other async actions, such as linking actual remote sinks.

*

* @return a Releasable that should be called when the caller no longer needs to prevent the exchange source from completing.

*/

public Releasable addEmptySink() {

outstandingSinks.trackNewInstance();

return outstandingSinks::finishInstance;

}

so all I did was to wrap everything in a try (var ignored = mainExchangeSource.addEmptySink()) {} and also made sure that we execute the main plan first, followed by the sub plans.
This is the exact pattern we have in ComputeService#executePlan.

ioanatia · 2025-04-16T12:39:15Z

I'd love to see a new test in single_node.RestEsqlIT that checks on the output of the profile. It'd be super duper nice if each of the forked drivers could identify themselves in the profile output.

@nik I agree it's super important - we don't have this at the moment - just checked the output and it's not what we want.
This is now added as a follow up in: #121950

Which means we commit to have proper profile information as a requirement for a tech preview release - we don't release without it.

ChrisHegarty

This still LGTM.

Stream result pages from sub plans for FORK

bfb9dee

ioanatia added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch :Search Relevance/Search Catch all for Search Relevance v9.1.0 labels Apr 11, 2025

Use different capability for csv tests

c209f02

ChrisHegarty approved these changes Apr 14, 2025

View reviewed changes

ioanatia added 5 commits April 14, 2025 12:27

Add javadoc

07a0b20

Merge branch 'elasticsearch/main' into fork_streaming

0ff3676

Fix RRF tests

7b0aaa3

Remove MergeOperator since it's no longer needed

753632d

Fix another RRF test

768c581

ioanatia requested review from dnhatn and nik9000 April 15, 2025 11:11

ioanatia marked this pull request as ready for review April 15, 2025 11:11

nik9000 approved these changes Apr 15, 2025

View reviewed changes

nik9000 reviewed Apr 15, 2025

View reviewed changes

dnhatn approved these changes Apr 16, 2025

View reviewed changes

Merge branch 'elasticsearch/main' into fork_streaming

0d83c20

ioanatia added the >non-issue label Apr 16, 2025

ioanatia and others added 2 commits April 16, 2025 11:53

Add empty sink to prevent exchange source to finish too early

b6ecc10

Merge branch 'main' into fork_streaming

62902e4

ioanatia commented Apr 16, 2025

View reviewed changes

ChrisHegarty approved these changes Apr 16, 2025

View reviewed changes

ioanatia merged commit 5a6509a into elastic:main Apr 16, 2025
17 checks passed

ioanatia deleted the fork_streaming branch April 16, 2025 12:57

This was referenced Apr 22, 2025

Streaming execution for FORK #126389

Closed

ES|QL: Skip FORK tests for CCS for now #127309

Merged

ioanatia mentioned this pull request May 22, 2025

ES|QL: label drivers in profile info for FORK #128318

Merged

		@@ -67,6 +69,25 @@

		public class PlannerUtils {

		public static Tuple<List<PhysicalPlan>, PhysicalPlan> breakPlanIntoSubPlansAndMainPlan(PhysicalPlan plan) {

	/**
	* Links this exchange source with an empty/dummy remote sink. The purpose of this is to prevent this exchange source from finishing
	* until we have performed other async actions, such as linking actual remote sinks.
	*
	* @return a Releasable that should be called when the caller no longer needs to prevent the exchange source from completing.
	*/
	public Releasable addEmptySink() {
	outstandingSinks.trackNewInstance();
	return outstandingSinks::finishInstance;
	}

Stream result pages from sub plans for FORK #126705

Stream result pages from sub plans for FORK #126705

Uh oh!

Conversation

ioanatia commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ioanatia Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

ioanatia Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Apr 15, 2025

Uh oh!

ioanatia commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 commented Apr 15, 2025

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 15, 2025

Uh oh!

dnhatn commented Apr 15, 2025

Uh oh!

dnhatn commented Apr 16, 2025

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Apr 16, 2025

Uh oh!

ioanatia Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ioanatia commented Apr 16, 2025

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ioanatia commented Apr 11, 2025 •

edited

Loading

ioanatia Apr 15, 2025 •

edited

Loading

ioanatia commented Apr 15, 2025 •

edited

Loading

ioanatia Apr 16, 2025 •

edited

Loading