Skip to content

Add XmlProcessor initial implementation #130337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

marc-gr
Copy link

@marc-gr marc-gr commented Jun 30, 2025

This PR creates a new XML processor that achieves feature parity with Logstash's XML filter.

⚙️ Configuration Options

processors:
  - xml:
      field: "xml_data"
      target_field: "parsed"
      to_lower: false
      # Logstash-compatible options
      xpath:
        "/root/item/@id": "item_id"
        "//product/name/text()": "product_name"
      namespaces:
        "ns": "http://example.com/namespace"
      force_array: true
      force_content: false
      remove_namespaces: false
      ignore_empty_value: true
      parse_options: "strict"

🏗️ Architecture

  • Streaming SAX Parser: Optimal memory usage for large XML documents
  • Selective DOM Building: Only builds DOM when XPath expressions are configured
  • Pre-compiled XPath: XPath expressions compiled at processor creation for performance
  • Security: Enhanced XXE protection with secure parser factory configurations

📚 Documentation

Documentation includes:

  • Complete configuration reference
  • XPath expression examples
  • Namespace configuration guide

Logstash differences

  • ignore_empty_value behaves a bit different than suppress_empty, but I think it matches better with other processors behavior. It could be adapted, or even add both, but I found it confusing.

Closes #97364

Copy link
Contributor

github-actions bot commented Jun 30, 2025

🔍 Preview links for changed docs:

🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes.

@marc-gr marc-gr force-pushed the feat/xml-processor branch from 95df637 to 67dd264 Compare June 30, 2025 14:28
@marc-gr marc-gr requested a review from Copilot June 30, 2025 14:29
@marc-gr marc-gr marked this pull request as ready for review June 30, 2025 14:29
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 30, 2025
@marc-gr marc-gr added the Team:Security Meta label for security team label Jun 30, 2025
@elasticsearchmachine elasticsearchmachine removed the Team:Security Meta label for security team label Jun 30, 2025
Copilot

This comment was marked as outdated.

@marc-gr marc-gr added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jul 1, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @marc-gr, I've created a changelog YAML for you.

- Replace XMLStreamReader with SAX parser + DOM for XPath support
- Add XPath extraction, namespaces, strict parsing, content filtering
- New options: force_array, force_content, remove_namespaces, store_xml
- Enhanced security with XXE protection and pre-compiled XPath expressions
- Full test coverage and updated documentation
Copy link
Contributor

github-actions bot commented Jul 4, 2025

@marc-gr marc-gr requested a review from Copilot July 4, 2025 13:58
Copilot

This comment was marked as outdated.

marc-gr added 2 commits July 4, 2025 16:22
- Fix test assertion for remove_namespaces feature
- Use StandardCharsets.UTF_8 instead of string literal
- Replace string reference comparison with isEmpty()
- Move regex pattern to static final field for performance
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds an initial implementation of a new XmlProcessor to parse XML input into JSON-like structures with feature parity to Logstash’s XML filter, alongside configuration options, factory validation, and documentation.

  • Introduce XmlProcessor with streaming SAX parsing, optional DOM building for XPath, and secure defaults.
  • Add end-to-end tests (XmlProcessorTests) and factory validation tests (XmlProcessorFactoryTests).
  • Register the processor in the plugin, update module-info, documentation, and changelog.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/XmlProcessor.java New XML parsing processor implementation
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorTests.java End-to-end tests for XML parsing behavior
modules/ingest-common/src/test/java/org/elasticsearch/ingest/common/XmlProcessorFactoryTests.java Tests for factory config and validation
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java Register XmlProcessor in the plugin registry
modules/ingest-common/src/main/java/module-info.java Add requires java.xml for XML APIs
docs/reference/enrich-processor/xml-processor.md Documentation for XML processor
docs/reference/enrich-processor/toc.yml Add entry for xml-processor.md
docs/reference/enrich-processor/index.md Include xml processor in the index
docs/changelog/130337.yaml Changelog entry for PR
Comments suppressed due to low confidence (1)

docs/reference/enrich-processor/xml-processor.md:9

  • The implementation actually uses a streaming SAX parser with optional DOM building for XPath. Update this description to reflect the streaming-based approach for accurate documentation.
Parses XML documents and converts them to JSON objects using a DOM parser. This processor efficiently handles XML data with a single-parse architecture that supports both structured output and XPath extraction for optimal performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ingest Pipeline] XML Processor
2 participants