[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

xiaonanyang-db · 2025-06-26T01:46:44Z

What changes were proposed in this pull request?

Today, the XML parser is not memory efficient. It loads each XML record into memory first before parsing, which causes an OOM if the input XML record is large. This PR improves the parser to parse an XML record token by token to avoid copying the entire XML record into memory ahead of time.

The optimization also introduces deterministic behavior in handling malformed XML files. Currently, the XML parser doesn't scavenge all valid records deterministically. After the improvement, the parser will fail at the first corrupt record but return all the valid ones before it.

Why are the changes needed?

Solve the OOM issue in XML ingestion.

Does this PR introduce any user-facing change?

No. The new behavior is disabled by default for now.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

sandip-db · 2025-07-05T00:11:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala

+          Utils.tryWithResource(
+            CodecStreams.createInputStreamWithCloseResource(conf, file.toPath)
+          ) { is =>
+            UTF8String.fromBytes(ByteStreams.toByteArray(is))


This may hit java byte array limit that this PR is trying to address. Limit it to 1GB.

Right, but it will only be called for corrupted files.

It shouldn't throw an exception when ignoreCorruptFiles is set.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala

.../src/test/scala/org/apache/spark/sql/execution/datasources/xml/OptimizedXMLParserSuite.scala

sandip-db · 2025-07-05T02:04:55Z

.../src/test/scala/org/apache/spark/sql/execution/datasources/xml/OptimizedXMLParserSuite.scala

+    // Malformed recording handling is slightly different in optimized XML parser
+    "DSL test for parsing a malformed XML file",
+    "DSL test for permissive mode for corrupt records",
+    "DSL test with malformed attributes",
+    "DSL test for dropping malformed rows",
+    "DSL: handle malformed record in singleVariantColumn mode",
+    // No valid row will be found in `unclosed_tag.xml` by the OptimizedXMLTokenizer
+    "test FAILFAST with unclosed tag",


Lets update some of these tests to have some valid rowTags in the beginning and a corrupt one towards the end.

For this specific test case, the record will fail at the tokenizing stage and will not enter the parsing stage; thus, it will not fail in the FAILFAST mode as originally expected.

.../src/test/scala/org/apache/spark/sql/execution/datasources/xml/OptimizedXMLParserSuite.scala

LuciferYang · 2025-07-07T04:11:29Z

The optimization is governed by an SQL conf and disabled by default because it comes with two consequences:

XSD validation is not supported in the optimized parser. The current XSD validation works only on a full XML record string, which violates the optimization here.

Behavior change of corrupted record handling. Currently, good records in an XML file can be parsed correctly even if there is a corrupted XML record in between, but they won't be parsed anymore in the optimized parser. For example, given the following XML file:
<ROWS>
 <ROW>1</ROW>
 <ROW></ROW>
 <ROW>2</ROW>
</ROWS>
In the current parser, both the 1st and 3rd records will be parsed correctly, and the second one will be moved to the corrupted data column. In the new parser, only the 1st record will be parsed and the whole document will be moved to the corrupted data column as the second record.

From the description in the pr, enabling the optimization will disrupt some existing behaviors. Have there been similar cases in the past? cc @cloud-fan

xiaonanyang-db · 2025-07-08T02:29:10Z

The optimization is governed by an SQL conf and disabled by default because it comes with two consequences:

XSD validation is not supported in the optimized parser. The current XSD validation works only on a full XML record string, which violates the optimization here.

Behavior change of corrupted record handling. Currently, good records in an XML file can be parsed correctly even if there is a corrupted XML record in between, but they won't be parsed anymore in the optimized parser. For example, given the following XML file:
<ROWS>
 <ROW>1</ROW>
 <ROW></ROW>
 <ROW>2</ROW>
</ROWS>
In the current parser, both the 1st and 3rd records will be parsed correctly, and the second one will be moved to the corrupted data column. In the new parser, only the 1st record will be parsed and the whole document will be moved to the corrupted data column as the second record.
From the description in the pr, enabling the optimization will disrupt some existing behaviors. Have there been similar cases in the past? cc @cloud-fan

Updated the PR to eliminate the behavior gaps between the optimized version and the existing version.

cloud-fan · 2025-07-08T06:21:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

@@ -292,7 +283,9 @@ class XmlSuite
    )
  }

-  test("test FAILFAST with unclosed tag") {
+  // The record in the test XML file doesn't have a valid start row tag, thus no record is tokenized
+  // and parsed.


what was the behavior before?

cloud-fan · 2025-07-08T06:21:42Z

No. The new behavior is disabled by default for now.

@xiaonanyang-db can we explain the behavior difference in details if the conf is enabled?

sandip-db · 2025-07-08T02:43:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

-    convertStream(xmlTokenizer) { tokens =>
-      safeParser.parse(tokens)
+    convertStream(xmlTokenizer) { parser =>
+      safeParser.parse(parser)


when will this parser close?

It will be closed in two cases:

In XMLTokenizer, when there are no more records

doParseColumn, when we run into a corrupted record

sandip-db · 2025-07-08T02:44:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala

          result
      }
    } catch {
      case e: SparkUpgradeException => throw e
      case e@(_: RuntimeException | _: XMLStreamException | _: MalformedInputException
              | _: SAXException) =>
+        // Skip rest of the content in the parser and put the whole XML file in the
+        // BadRecordException.
+        parser.close()


When will the non-exception path close the parser?

In XMLTokenizer.next(), when there are no more records

sandip-db · 2025-07-08T04:37:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

@@ -35,29 +35,39 @@ object StaxXmlParserUtils {
    factory
  }

+  val filter = new EventFilter {


Can this filter be reused across multiple readers? If not, define a function that return the filter.

I think it can be reused as it's stateless

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala

sandip-db · 2025-07-08T04:58:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XMLTokenizer.scala

+          case _: RuntimeException | _: IOException if options.ignoreCorruptFiles =>
+            logWarning("Skipping the rest of the content in the corrupted file", e)
+          case _: XMLStreamException =>
+            logWarning("Skipping the rest of the content in the corrupted file", e)


Why is XMLStreamException a separate case? Shouldn't there be a call to close?

Oops, those two cases can be combined.

Actually, we shouldn't need separate closes in each case, the parser will be closed in the finally block

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

sandip-db · 2025-07-08T06:57:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlSuite.scala

-      val schema = buildSchema(field("a1", LongType), field("a2", LongType))
+      val schema = buildSchema(
+        field("_corrupted_record", LongType), field("a1", LongType), field("a2", LongType))


Why was _corrupted_record field required when writing data.

sandip-db · 2025-07-08T06:59:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/XmlVariantSuite.scala

-    assert(
-      singleVariantColumn.isDefined || schemaDDL.isDefined,
-      "Either singleVariantColumn or schema must be defined to ingest XML files as variants via DSL"
-    )


Why is this assert no longer valid?

As we support the user-specified schema of the single variant column and the corrupted record column now in singleVariantColumn mode

xiaonanyang-db added 2 commits June 25, 2025 18:41

draft

05e975c

add test

854d3de

xiaonanyang-db marked this pull request as ready for review June 26, 2025 01:46

github-actions bot added the SQL label Jun 26, 2025

HyukjinKwon changed the title ~~[SPARK-52582] Improve the memory usage of XML parser~~ [SPARK-52582][SQL] Improve the memory usage of XML parser Jun 26, 2025

LuciferYang reviewed Jun 26, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XmlDataSource.scala Outdated Show resolved Hide resolved

xiaonanyang-db added 12 commits June 30, 2025 14:14

fix

6d2a960

address comments

57c9685

u

f8f8216

u

f787809

u

6baa722

u

02b09f6

u

57d3105

u

2cb368f

u

5fe3877

u

26a8565

u

3e959a8

use optimized parser in schema inerence

b52e78c

sandip-db suggested changes Jul 5, 2025

View reviewed changes

address comments

74ecf46

xiaonanyang-db force-pushed the SPARK-52582 branch from 006a14f to 74ecf46 Compare July 7, 2025 03:58

xiaonanyang-db added 3 commits July 6, 2025 21:48

u

fcf4af8

minor

92c0e87

support XSD validation

5c7164e

xiaonanyang-db force-pushed the SPARK-52582 branch from 82d7af0 to 5c7164e Compare July 8, 2025 02:13

u

b137fd8

xiaonanyang-db requested review from sandip-db and LuciferYang July 8, 2025 02:24

cloud-fan reviewed Jul 8, 2025

View reviewed changes

sandip-db suggested changes Jul 8, 2025

View reviewed changes

xiaonanyang-db added 3 commits July 8, 2025 13:20

address comments - 1

bedd7c5

address comments - 2

39f27b7

address comments - 3

5742935

xiaonanyang-db requested a review from sandip-db July 8, 2025 21:56

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

Are you sure you want to change the base?

[SPARK-52582][SQL] Improve the memory usage of XML parser #51287

Conversation

xiaonanyang-db commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LuciferYang commented Jul 7, 2025

Uh oh!

xiaonanyang-db commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaonanyang-db Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiaonanyang-db commented Jun 26, 2025 •

edited

Loading

xiaonanyang-db Jul 7, 2025 •

edited

Loading

cloud-fan commented Jul 8, 2025 •

edited

Loading

xiaonanyang-db Jul 8, 2025 •

edited

Loading