[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

xiaonanyang-db · 2025-06-17T17:45:52Z

What changes were proposed in this pull request?

When reading files under a directory with a structure like /path/to/data/month=01/day=01 and with singleVariantColumn enabled, the partition columns are inferred and added to the destination as top-level fields in addition to the single variant column, e.g.,

root
 |-- var: variant (nullable = true)
 |-- month: int (nullable = true)
 |-- day: int (nullable = true)

This behavior is semantically wrong as it produces additional columns other than a single variant column declared by the option name. Meanwhile, it complicates customer CUJ as users have to deal with partition schema evolution even when using Variant.

This PR introduces the ability to include the partition columns in the single variant column for all the supported file formats: JSON, CSV, and XML.

A caveat is that variantAllowDuplicateKeys is required to be true for JSON and CSV when the partition schema overlaps with the data schema in the variant column so that the partition values can overwrite the data values for the overlapped columns (the current behavior for non-variant ingesiton)

Why are the changes needed?

As described above.

Does this PR introduce any user-facing change?

No, new behavior is controlled by a flag and disabled by default.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

sandip-db · 2025-07-06T17:52:06Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlVariantSuite.scala

+              // The year partition column overlaps with the data columns in the source XML file.
+              // In this case, we should use the value of the partition column.


What is the behaviour when singleVariantColumn is not specified and partition column overlaps with the data column?

The partition values will overwrite the data values.

sandip-db · 2025-07-06T17:56:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

+                  // If the partition schema overlaps with the data schema, we **OVERRIDE** the
+                  // data with the partition values.


What is the behavior during overlap without singleVariantColumn? Can the variant field be converted to an array in this case?

As mentioned above, the partition values will overwrite the data values

sandip-db · 2025-07-06T18:02:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+      // Add the partition columns to the variant object
+      if (partitionSchema.nonEmpty && SQLConf.get.includePartitionColumnsInSingleVariantColumn) {
+        partitionSchema.zipWithIndex.foreach { case (field, index) =>
+          val value = partitionValues.get(index, field.dataType)


Why are there different implementations adding partition column for CSV, XML and JSON. Can't there be an utility function that takes as input partitionSchema and partitionValues, and add it to an existing variant builder?

The three parsers have different implementation logic for variant parsing. For example, CSV is simpler as it doesn't have nested fields, and XML is more complicated because of nested fields and special array structure. We can try maximizing the shared logic, but it's hard to have a common function applicable to all.

sandip-db · 2025-07-06T18:09:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlVariantSuite.scala

+  test("XML with hive-style partition columns in singleVariantColumn mode") {
+    withTempDir { dir =>
+      // Create partitioned directory structure and copy file to each partition
+      val path = s"${dir.getCanonicalPath}/year=2021/month=01"


Add another file with the following path:
s"${dir.getCanonicalPath}/year=2022/month=01/day=01"
Test two scenarios:
i) Both files in same partition
ii) Files are in separate partition

Do the same for other formats.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

sandip-db · 2025-07-06T19:18:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        if (fsRelation.options.contains(DataSourceOptions.SINGLE_VARIANT_COLUMN) &&
+            SQLConf.get.includePartitionColumnsInSingleVariantColumn) {
+          Seq.empty


How will push down partition filter work if partitionColumns is set to empty?

sandip-db · 2025-07-06T19:24:12Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -4185,6 +4185,38 @@ abstract class JsonSuite
      }
    }
  }
+
+  test("JSON with hive-style partition columns in singleVariantColumn mode") {


Add tests with different spark.sql.variant.allowDuplicateKeys settings for all 3 formats.

…ARK-52495

draft

29f1fa0

github-actions bot added the SQL label Jun 17, 2025

xiaonanyang-db added 5 commits June 17, 2025 13:44

more changes

b867dfb

u

bd05c71

more csv tests

2af71b9

u

f1494f8

json

fcc5bef

xiaonanyang-db marked this pull request as ready for review June 21, 2025 00:40

xiaonanyang-db changed the title ~~[SPARK-52495] Include partition columns in the single variant column~~ [SPARK-52495] Allow including partition columns in the single variant column Jun 21, 2025

xiaonanyang-db added 2 commits June 20, 2025 17:57

u

baa6ff9

Merge branch 'master' into SPARK-52495

9b9a4de

xiaonanyang-db changed the title ~~[SPARK-52495] Allow including partition columns in the single variant column~~ [SPARK-52495][SQL] Allow including partition columns in the single variant column Jul 2, 2025

Merge branch 'master' into SPARK-52495

2e8279d

sandip-db suggested changes Jul 6, 2025

View reviewed changes

xiaonanyang-db added 2 commits July 7, 2025 21:57

address comments - 1

e536065

Merge branch 'SPARK-52495' of github.com:xiaonanyang-db/spark into SP…

64548c0

…ARK-52495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

Uh oh!

xiaonanyang-db commented Jun 17, 2025 •

edited

Loading

Uh oh!

sandip-db Jul 6, 2025

Uh oh!

xiaonanyang-db Jul 8, 2025

Uh oh!

sandip-db Jul 6, 2025

Uh oh!

xiaonanyang-db Jul 8, 2025

Uh oh!

sandip-db Jul 6, 2025

Uh oh!

xiaonanyang-db Jul 8, 2025 •

edited

Loading

Uh oh!

sandip-db Jul 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sandip-db Jul 6, 2025 •

edited

Loading

Uh oh!

sandip-db Jul 6, 2025

Uh oh!

Uh oh!

		// The year partition column overlaps with the data columns in the source XML file.
		// In this case, we should use the value of the partition column.

		// If the partition schema overlaps with the data schema, we OVERRIDE the
		// data with the partition values.

[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

Are you sure you want to change the base?

[SPARK-52495][SQL] Allow including partition columns in the single variant column #51206

Uh oh!

Conversation

xiaonanyang-db commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sandip-db Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sandip-db Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sandip-db Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandip-db Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sandip-db Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandip-db Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiaonanyang-db commented Jun 17, 2025 •

edited

Loading

xiaonanyang-db Jul 8, 2025 •

edited

Loading

sandip-db Jul 6, 2025 •

edited

Loading