Skip to content

ESQL: Have CombineProjections propagate references upwards #127264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,8 @@ protected NodeInfo<Alias> info() {
}

public Alias replaceChild(Expression child) {
return new Alias(source(), name(), child, id(), synthetic());
// these "nop" replacements can occur on attribute map resolutions having the default return same as the lookup key
return child == this.child ? this : new Alias(source(), name(), child, id(), synthetic());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an optimization, unrelated to this PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It's supposed to avoid creating objects while planning, if not needed. The change is small and not sure if a stand-alone PR would make sense, but happy to revert and split, if needed.

}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import org.elasticsearch.xpack.esql.core.expression.Expression;
import org.elasticsearch.xpack.esql.core.expression.Expressions;
import org.elasticsearch.xpack.esql.core.expression.NamedExpression;
import org.elasticsearch.xpack.esql.core.expression.ReferenceAttribute;
import org.elasticsearch.xpack.esql.expression.function.grouping.GroupingFunction;
import org.elasticsearch.xpack.esql.plan.logical.Aggregate;
import org.elasticsearch.xpack.esql.plan.logical.LogicalPlan;
Expand Down Expand Up @@ -126,6 +127,8 @@ private static List<NamedExpression> combineProjections(List<? extends NamedExpr
if (ne instanceof Alias as) {
Expression child = as.child();
namedExpressionsBuilder.put(ne.toAttribute(), as.replaceChild(aliasesBuilder.build().resolve(child, child)));
} else if (ne instanceof ReferenceAttribute ra) {
namedExpressionsBuilder.put(ra, ra);
Comment on lines +130 to +131
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This captures aliases but now also References - why do we care about them and what about other types of named expression (e.g. FieldAttribute)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this change doesn't affect queries and their computation, it just opportunistically replaces e.g. emp_no{f} by emp_no{r} (with same name id) in some projections if there's something like

FROM employees | KEEP emp_no

but emp_no is missing and ReplaceMissingFieldsWithNull injects an EVAL emp_no = null after the FROM.

I don't think this makes a difference, really, as the attribute type mismatch between emp_no{f} and emp_no{r} remains in most other commands in the query (but this is, currently, just a "cosmetic" issue in the planning).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just opportunistically replaces e.g. emp_no{f} by emp_no{r} (with same name id) in some projections if there's something like

It's not just then. There'll always be two projections to combine in case of a locally missing field, due to the effects of ProjectAwayColumns and ReplaceMissingFieldWithNull both creating Project nodes.

I don't think this makes a difference, really, as the attribute type mismatch between emp_no{f} and emp_no{r} remains in most other commands in the query (but this is, currently, just a "cosmetic" issue in the planning).

IMO the change the PR proposes is a correction, not just "cosmetic". But it's true that it currently makes no practical difference.
But to exemplify why this works now: FROM ... | EVAL x = missing_field + 1 gets planned as Eval - Limit - EsRelation. We're pushing Eval down as much as possible, but we push Limit "harder". With the effect that the LimitExec "breaks" the plan and the local fragment lacks the Eval, which makes the extraction work (i.e. actually skip the field).
But ideally this evaluation would be distributed and one day Eval pushed into the fragment as well. The InsertFieldExtraction only looks at FieldAttributes (and MetadataAttributes), so we we'd want to have the merge generate a ReferenceAttribute instead of a field one.

This captures aliases but now also References - why do we care about them and what about other types of named expression (e.g. FieldAttribute)?

I hope the above clarifies. But happy to give further details, if needed.

But I'm also happy to revert the change and add a comment instead. :)

Copy link
Contributor

@alex-spies alex-spies Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsertFieldExtraction only looks at FieldAttributes (and MetadataAttributes), so we we'd want to have the merge generate a ReferenceAttribute instead of a field one.

It's true that replacing the field attribute by a ref attr downstream in the plan will make InsertFieldExtrac gloss over the missing attribute altogether.

But InsertFieldExtract's existing mechanism already compares with the upstream output by name id (we use attribute sets), so this is already taken care of. Or should be, anyway.

}
}
List<NamedExpression> replaced = new ArrayList<>(upper.size());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ public LogicalPlan apply(LogicalPlan plan, LocalLogicalOptimizerContext localLog
// Do not use the attribute name, this can deviate from the field name for union types; use fieldName() instead.
// Also retain fields from lookup indices because we do not have stats for these.
Predicate<FieldAttribute> shouldBeRetained = f -> f.field() instanceof PotentiallyUnmappedKeywordEsField
|| (localLogicalOptimizerContext.searchStats().exists(f.fieldName()) || lookupFields.contains(f));
|| localLogicalOptimizerContext.searchStats().exists(f.fieldName())
|| lookupFields.contains(f);

return plan.transformUp(p -> missingToNull(p, shouldBeRetained));
}
Expand All @@ -77,7 +78,7 @@ private LogicalPlan missingToNull(LogicalPlan plan, Predicate<FieldAttribute> sh
for (int i = 0, size = relationOutput.size(); i < size; i++) {
Attribute attr = relationOutput.get(i);
NamedExpression projection;
if (attr instanceof FieldAttribute f && (shouldBeRetained.test(f) == false)) {
if (attr instanceof FieldAttribute f && shouldBeRetained.test(f) == false) {
DataType dt = f.dataType();
Alias nullAlias = nullLiterals.get(dt);
// save the first field as null (per datatype)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

package org.elasticsearch.xpack.esql.optimizer;

import org.apache.lucene.util.BytesRef;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.util.Maps;
import org.elasticsearch.index.IndexMode;
Expand All @@ -22,6 +23,7 @@
import org.elasticsearch.xpack.esql.core.expression.FieldAttribute;
import org.elasticsearch.xpack.esql.core.expression.FoldContext;
import org.elasticsearch.xpack.esql.core.expression.Literal;
import org.elasticsearch.xpack.esql.core.expression.ReferenceAttribute;
import org.elasticsearch.xpack.esql.core.tree.NodeInfo;
import org.elasticsearch.xpack.esql.core.tree.Source;
import org.elasticsearch.xpack.esql.core.type.DataType;
Expand Down Expand Up @@ -149,10 +151,10 @@ public void testMissingFieldInFilterString() {

/**
* Expects
* Project[[last_name{f}#6]]
* Project[[last_name{r}#7]]
* \_Eval[[null[KEYWORD] AS last_name]]
* \_Limit[10000[INTEGER]]
* \_EsRelation[test][_meta_field{f}#8, emp_no{f}#2, first_name{f}#3, gen..]
* \_Limit[1000[INTEGER],false]
* \_EsRelation[test][_meta_field{f}#9, emp_no{f}#3, first_name{f}#4, gen..]
*/
public void testMissingFieldInProject() {
var plan = plan("""
Expand All @@ -166,7 +168,7 @@ public void testMissingFieldInProject() {
var project = as(localPlan, Project.class);
var projections = project.projections();
assertThat(Expressions.names(projections), contains("last_name"));
as(projections.get(0), FieldAttribute.class);
as(projections.get(0), ReferenceAttribute.class);
var eval = as(project.child(), Eval.class);
assertThat(Expressions.names(eval.fields()), contains("last_name"));
var alias = as(eval.fields().get(0), Alias.class);
Expand All @@ -179,6 +181,39 @@ public void testMissingFieldInProject() {
assertThat(Expressions.names(source.output()), not(contains("last_name")));
}

/*
* Expects a similar plan to testMissingFieldInProject() above, except for the Alias's child value
* Project[[last_name{r}#4]]
* \_Eval[[[66 6f 6f][KEYWORD] AS last_name]]
* \_Limit[1000[INTEGER],false]
* \_EsRelation[test][_meta_field{f}#11, emp_no{f}#5, first_name{f}#6, ge..]
*/
public void testReassignedMissingFieldInProject() {
var plan = plan("""
from test
| keep last_name
| eval last_name = "foo"
""");

var testStats = statsForMissingField("last_name");
var localPlan = localPlan(plan, testStats);

var project = as(localPlan, Project.class);
var projections = project.projections();
assertThat(Expressions.names(projections), contains("last_name"));
as(projections.get(0), ReferenceAttribute.class);
var eval = as(project.child(), Eval.class);
assertThat(Expressions.names(eval.fields()), contains("last_name"));
var alias = as(eval.fields().get(0), Alias.class);
var literal = as(alias.child(), Literal.class);
assertThat(literal.value(), is(new BytesRef("foo")));
assertThat(literal.dataType(), is(DataType.KEYWORD));

var limit = as(eval.child(), Limit.class);
var source = as(limit.child(), EsRelation.class);
assertThat(Expressions.names(source.output()), not(contains("last_name")));
}

/**
* Expects
* EsqlProject[[first_name{f}#4]]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@
import org.elasticsearch.xpack.esql.analysis.Verifier;
import org.elasticsearch.xpack.esql.core.expression.Alias;
import org.elasticsearch.xpack.esql.core.expression.Expressions;
import org.elasticsearch.xpack.esql.core.expression.FieldAttribute;
import org.elasticsearch.xpack.esql.core.expression.Literal;
import org.elasticsearch.xpack.esql.core.expression.ReferenceAttribute;
import org.elasticsearch.xpack.esql.core.tree.Source;
import org.elasticsearch.xpack.esql.core.type.DataType;
import org.elasticsearch.xpack.esql.core.type.EsField;
Expand Down Expand Up @@ -1307,15 +1307,16 @@ private EsQueryExec doTestOutOfRangeFilterPushdown(String query, Analyzer analyz
return luceneQuery;
}

/**
/*
* Expects
* LimitExec[1000[INTEGER]]
* \_ExchangeExec[[],false]
* \_ProjectExec[[_meta_field{f}#8, emp_no{r}#2, first_name{r}#3, gender{f}#4, job{f}#9, job.raw{f}#10, languages{f}#5, first_n
* ame{r}#3 AS last_name, long_noidx{f}#11, emp_no{r}#2 AS salary]]
* \_FieldExtractExec[_meta_field{f}#8, gender{f}#4, job{f}#9, job.raw{f}..]
* \_ExchangeExec[[_meta_field{f}#8, emp_no{f}#2, first_name{f}#3, gender{f}#4, hire_date{f}#9, job{f}#10, job.raw{f}#11, langua
* ges{f}#5, last_name{f}#6, long_noidx{f}#12, salary{f}#7],false]
* \_ProjectExec[[_meta_field{f}#8, emp_no{r}#2, first_name{r}#3, gender{f}#4, hire_date{f}#9, job{f}#10, job.raw{f}#11, langua
* ges{f}#5, first_name{r}#3 AS last_name, long_noidx{f}#12, emp_no{r}#2 AS salary]]
* \_FieldExtractExec[_meta_field{f}#8, gender{f}#4, hire_date{f}#9, job{..]<[],[]>
* \_EvalExec[[null[INTEGER] AS emp_no, null[KEYWORD] AS first_name]]
* \_EsQueryExec[test], query[][_doc{f}#12], limit[1000], sort[] estimatedRowSize[270]
* \_EsQueryExec[test], indexMode[standard], query[][_doc{f}#13], limit[1000], sort[] estimatedRowSize[278]
*/
public void testMissingFieldsDoNotGetExtracted() {
var stats = EsqlTestUtils.statsForMissingField("first_name", "last_name", "emp_no", "salary");
Expand All @@ -1342,9 +1343,9 @@ public void testMissingFieldsDoNotGetExtracted() {
)
);
// emp_no
assertThat(projections.get(1), instanceOf(FieldAttribute.class));
assertThat(projections.get(1), instanceOf(ReferenceAttribute.class));
// first_name
assertThat(projections.get(2), instanceOf(FieldAttribute.class));
assertThat(projections.get(2), instanceOf(ReferenceAttribute.class));

// last_name --> first_name
var nullAlias = Alias.unwrap(projections.get(8));
Expand Down