-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52575][SQL] Introduce contextIndependentFoldable attribute for Expressions #51282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-52575][SQL] Introduce contextIndependentFoldable attribute for Expressions #51282
Conversation
I will create a follow-up to use this new method in the V2 expression conversion
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be super helpful to simplify expressions before converting them to DSv2. I only have a question about marking expressions context-free by default for all unary and binary expressions.
* | ||
* Default is false to ensure explicit marking of context independence. | ||
*/ | ||
def contextIndependentFoldable: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we could consider splitting contextIndependent
and foldable
into separate methods. Are there use cases where we want to know if the expression is context independent without checking it is foldable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of such special cases. I think we can always check foldable
before checking contextIndependentFoldable
, which seems very safe to me. The syntax for this method is when it is foldable, does it depend on context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some unfoldable expression could be context independent. Such as: 1 + a', the type of attribute a' is int.
But I don't know should we consider these cases.
I'm +1 for @aokolnychyi if we must consider the case I mentioned.
I'm +1 for @gengliangwang if we not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such as: 1 + a', the type of attribute a' is int
@beliefer in such a case, column a'
is not contextIndependent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means the context that is related to the Spark context or Session context. The value of attribute is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or the time zone etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is how to define the context ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context-independent means that the expression is independent of contexts(current user, current database, current time, etc) when it is foldable.
For such expressions, we can evaluate them before storing in column default values, table constraints, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the foldable
is the prerequisite of context-independent
. Thank you for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
lazy val contextIndependentFoldable: Boolean = children.forall(_. contextIndependentFoldable)
here ?
So we can avoid change everywhere.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala
Outdated
Show resolved
Hide resolved
@@ -493,6 +493,11 @@ case class Cast( | |||
|
|||
final override def nodePatternsInternal(): Seq[TreePattern] = Seq(CAST) | |||
|
|||
override def contextIndependentFoldable: Boolean = { | |||
child.contextIndependentFoldable && !DataTypeUtils.containsTimestampOrUDT(dataType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think UDT matters here. The CAST implementation only allows pass-through of UDT casting, there is no operation and no context is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's be more specific about LTZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, can't we simply call needsTimeZone
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, updated
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
*/ | ||
def containsTimestampOrUDT(dataType: DataType): Boolean = { | ||
matchesPattern(dataType, | ||
dt => dt.isInstanceOf[TimestampType] || dt.isInstanceOf[UserDefinedType[_]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add another function containsUserDefinedType
and remove the dt.isInstanceOf[UserDefinedType[_]]
from here. It looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed containsTimestampOrUDT in the latest code. In the current code, the matchesPattern
method is only used here. I prefer to make it more general.
@@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.Cast.{toSQLId, toSQLType} | |||
import org.apache.spark.sql.catalyst.expressions.codegen._ | |||
import org.apache.spark.sql.catalyst.expressions.codegen.Block._ | |||
import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_ARITHMETIC, TreePattern} | |||
import org.apache.spark.sql.catalyst.types.{PhysicalDecimalType, PhysicalFractionalType, PhysicalIntegerType, PhysicalIntegralType, PhysicalLongType} | |||
import org.apache.spark.sql.catalyst.types.{DataTypeUtils, PhysicalDecimalType, PhysicalFractionalType, PhysicalIntegerType, PhysicalIntegralType, PhysicalLongType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
@@ -415,6 +415,11 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression { | |||
Literal.validateLiteralValue(value, dataType) | |||
|
|||
override def foldable: Boolean = true | |||
|
|||
override def contextIndependentFoldable: Boolean = { | |||
!DataTypeUtils.matchesPattern(dataType, dt => dt.isInstanceOf[UserDefinedType[_]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a literal be context dependent? It's not an operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at Literal.validateLiteralValue
:
case udt: UserDefinedType[_] => doValidate(v, udt.sqlType)
So the stored value is the same as UDT's underlying SQL type. I don't think it requires any context
What changes were proposed in this pull request?
Introduces a method to determine whether an expression can be folded without relying on any external context (e.g., time zone, session configurations, or catalogs). If an expression is context-independent foldable, it can be safely evaluated during DDL operations such as creating tables, views, or constraints. This enables systems to store the computed value rather than the original expression, simplifying implementation and improving performance. By default, expressions are not considered context-independent foldable to ensure explicit annotation of context independence.
Examples of context-independent foldable:
Examples of not context-independent foldable:
Why are the changes needed?
This allows catalogs and connectors to store the computed value rather than the expression itself, improving both simplicity and performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No