-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
// The year partition column overlaps with the data columns in the source XML file. | ||
// In this case, we should use the value of the partition column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behaviour when singleVariantColumn
is not specified and partition column overlaps with the data column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partition values will overwrite the data values.
// If the partition schema overlaps with the data schema, we **OVERRIDE** the | ||
// data with the partition values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior during overlap without singleVariantColumn
? Can the variant field be converted to an array in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, the partition values will overwrite the data values
// Add the partition columns to the variant object | ||
if (partitionSchema.nonEmpty && SQLConf.get.includePartitionColumnsInSingleVariantColumn) { | ||
partitionSchema.zipWithIndex.foreach { case (field, index) => | ||
val value = partitionValues.get(index, field.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there different implementations adding partition column for CSV, XML and JSON. Can't there be an utility function that takes as input partitionSchema
and partitionValues
, and add it to an existing variant builder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three parsers have different implementation logic for variant parsing. For example, CSV is simpler as it doesn't have nested fields, and XML is more complicated because of nested fields and special array structure. We can try maximizing the shared logic, but it's hard to have a common function applicable to all.
test("XML with hive-style partition columns in singleVariantColumn mode") { | ||
withTempDir { dir => | ||
// Create partitioned directory structure and copy file to each partition | ||
val path = s"${dir.getCanonicalPath}/year=2021/month=01" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another file with the following path:
s"${dir.getCanonicalPath}/year=2022/month=01/day=01"
Test two scenarios:
i) Both files in same partition
ii) Files are in separate partition
Do the same for other formats.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
if (fsRelation.options.contains(DataSourceOptions.SINGLE_VARIANT_COLUMN) && | ||
SQLConf.get.includePartitionColumnsInSingleVariantColumn) { | ||
Seq.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will push down partition filter work if partitionColumns is set to empty?
@@ -4185,6 +4185,38 @@ abstract class JsonSuite | |||
} | |||
} | |||
} | |||
|
|||
test("JSON with hive-style partition columns in singleVariantColumn mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests with different spark.sql.variant.allowDuplicateKeys
settings for all 3 formats.
What changes were proposed in this pull request?
When reading files under a directory with a structure like
/path/to/data/month=01/day=01
and withsingleVariantColumn
enabled, the partition columns are inferred and added to the destination as top-level fields in addition to the single variant column, e.g.,This behavior is semantically wrong as it produces additional columns other than a single variant column declared by the option name. Meanwhile, it complicates customer CUJ as users have to deal with partition schema evolution even when using Variant.
This PR introduces the ability to include the partition columns in the single variant column for all the supported file formats: JSON, CSV, and XML.
A caveat is that
variantAllowDuplicateKeys
is required to betrue
for JSON and CSV when the partition schema overlaps with the data schema in the variant column so that the partition values can overwrite the data values for the overlapped columns (the current behavior for non-variant ingesiton)Why are the changes needed?
As described above.
Does this PR introduce any user-facing change?
No, new behavior is controlled by a flag and disabled by default.
How was this patch tested?
New UTs.
Was this patch authored or co-authored using generative AI tooling?
No.