Skip to content

[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

xiaonanyang-db
Copy link
Contributor

@xiaonanyang-db xiaonanyang-db commented Jun 17, 2025

What changes were proposed in this pull request?

When reading files under a directory with a structure like /path/to/data/month=01/day=01 and with singleVariantColumn enabled, the partition columns are inferred and added to the destination as top-level fields in addition to the single variant column, e.g.,

root
 |-- var: variant (nullable = true)
 |-- month: int (nullable = true)
 |-- day: int (nullable = true)

This behavior is semantically wrong as it produces additional columns other than a single variant column declared by the option name. Meanwhile, it complicates customer CUJ as users have to deal with partition schema evolution even when using Variant.

This PR introduces the ability to include the partition columns in the single variant column for all the supported file formats: JSON, CSV, and XML.

A caveat is that variantAllowDuplicateKeys is required to be true for JSON and CSV when the partition schema overlaps with the data schema in the variant column so that the partition values can overwrite the data values for the overlapped columns (the current behavior for non-variant ingesiton)

Why are the changes needed?

As described above.

Does this PR introduce any user-facing change?

No, new behavior is controlled by a flag and disabled by default.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 17, 2025
@xiaonanyang-db xiaonanyang-db marked this pull request as ready for review June 21, 2025 00:40
@xiaonanyang-db xiaonanyang-db changed the title [SPARK-52495] Include partition columns in the single variant column [SPARK-52495] Allow including partition columns in the single variant column Jun 21, 2025
@xiaonanyang-db xiaonanyang-db changed the title [SPARK-52495] Allow including partition columns in the single variant column [SPARK-52495][SQL] Allow including partition columns in the single variant column Jul 2, 2025
Comment on lines +994 to +995
// The year partition column overlaps with the data columns in the source XML file.
// In this case, we should use the value of the partition column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behaviour when singleVariantColumn is not specified and partition column overlaps with the data column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partition values will overwrite the data values.

Comment on lines +1034 to +1035
// If the partition schema overlaps with the data schema, we **OVERRIDE** the
// data with the partition values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior during overlap without singleVariantColumn? Can the variant field be converted to an array in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the partition values will overwrite the data values

Comment on lines 384 to 387
// Add the partition columns to the variant object
if (partitionSchema.nonEmpty && SQLConf.get.includePartitionColumnsInSingleVariantColumn) {
partitionSchema.zipWithIndex.foreach { case (field, index) =>
val value = partitionValues.get(index, field.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there different implementations adding partition column for CSV, XML and JSON. Can't there be an utility function that takes as input partitionSchema and partitionValues, and add it to an existing variant builder?

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three parsers have different implementation logic for variant parsing. For example, CSV is simpler as it doesn't have nested fields, and XML is more complicated because of nested fields and special array structure. We can try maximizing the shared logic, but it's hard to have a common function applicable to all.

test("XML with hive-style partition columns in singleVariantColumn mode") {
withTempDir { dir =>
// Create partitioned directory structure and copy file to each partition
val path = s"${dir.getCanonicalPath}/year=2021/month=01"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another file with the following path:
s"${dir.getCanonicalPath}/year=2022/month=01/day=01"
Test two scenarios:
i) Both files in same partition
ii) Files are in separate partition

Do the same for other formats.

Comment on lines +169 to +171
if (fsRelation.options.contains(DataSourceOptions.SINGLE_VARIANT_COLUMN) &&
SQLConf.get.includePartitionColumnsInSingleVariantColumn) {
Seq.empty
Copy link
Contributor

@sandip-db sandip-db Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will push down partition filter work if partitionColumns is set to empty?

@@ -4185,6 +4185,38 @@ abstract class JsonSuite
}
}
}

test("JSON with hive-style partition columns in singleVariantColumn mode") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests with different spark.sql.variant.allowDuplicateKeys settings for all 3 formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants