Skip to content

Commit 95df637

Browse files
committed
Add XmlProcessor initial implementation
1 parent ed071cc commit 95df637

File tree

8 files changed

+1337
-1
lines changed

8 files changed

+1337
-1
lines changed

docs/reference/enrich-processor/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,9 @@ Refer to [Enrich your data](docs-content://manage-data/ingest/transform-enrich/d
159159
[`split` processor](/reference/enrich-processor/split-processor.md)
160160
: Splits a field into an array of values.
161161

162+
[`xml` processor](/reference/enrich-processor/xml-processor.md)
163+
: Parses XML documents and converts them to JSON objects.
164+
162165
[`trim` processor](/reference/enrich-processor/trim-processor.md)
163166
: Trims whitespace from field.
164167

docs/reference/enrich-processor/toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,4 @@ toc:
4646
- file: urldecode-processor.md
4747
- file: uri-parts-processor.md
4848
- file: user-agent-processor.md
49+
- file: xml-processor.md
Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
---
2+
navigation_title: "XML"
3+
mapped_pages:
4+
- https://www.elastic.co/guide/en/elasticsearch/reference/current/xml-processor.html
5+
---
6+
7+
# XML processor [xml-processor]
8+
9+
10+
Parses XML documents and converts them to JSON objects using a streaming XML parser. This processor efficiently handles XML data by avoiding loading the entire document into memory.
11+
12+
$$$xml-options$$$
13+
14+
| Name | Required | Default | Description |
15+
| --- | --- | --- | --- |
16+
| `field` | yes | - | The field containing the XML string to be parsed. |
17+
| `target_field` | no | `field` | The field that the converted structured object will be written into. Any existing content in this field will be overwritten. |
18+
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document. |
19+
| `ignore_failure` | no | `false` | Ignore failures for the processor. See [Handling pipeline failures](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md#handling-pipeline-failures). |
20+
| `to_lower` | no | `false` | Convert XML element names to lowercase. |
21+
| `ignore_empty_value` | no | `false` | If `true`, the processor will filter out null and empty values from the parsed XML structure, including empty elements, elements with null values, and elements with whitespace-only content. |
22+
| `description` | no | - | Description of the processor. Useful for describing the purpose of the processor or its configuration. |
23+
| `if` | no | - | Conditionally execute the processor. See [Conditionally run a processor](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md#conditionally-run-processor). |
24+
| `on_failure` | no | - | Handle failures for the processor. See [Handling pipeline failures](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md#handling-pipeline-failures). |
25+
| `tag` | no | - | Identifier for the processor. Useful for debugging and metrics. |
26+
27+
## Configuration
28+
29+
```js
30+
{
31+
"xml": {
32+
"field": "xml_field",
33+
"target_field": "parsed_xml",
34+
"ignore_empty_value": true
35+
}
36+
}
37+
```
38+
39+
## Examples
40+
41+
### Basic XML parsing
42+
43+
```console
44+
POST _ingest/pipeline/_simulate
45+
{
46+
"pipeline": {
47+
"processors": [
48+
{
49+
"xml": {
50+
"field": "xml_content"
51+
}
52+
}
53+
]
54+
},
55+
"docs": [
56+
{
57+
"_source": {
58+
"xml_content": "<catalog><book><author>William H. Gaddis</author><title>The Recognitions</title><review>One of the great seminal American novels.</review></book></catalog>"
59+
}
60+
}
61+
]
62+
}
63+
```
64+
65+
Result:
66+
67+
```console-result
68+
{
69+
"docs": [
70+
{
71+
"doc": {
72+
"_index": "_index",
73+
"_id": "_id",
74+
"_version": "-3",
75+
"_source": {
76+
"xml_content": "<catalog><book><author>William H. Gaddis</author><title>The Recognitions</title><review>One of the great seminal American novels.</review></book></catalog>",
77+
"catalog": {
78+
"book": {
79+
"author": "William H. Gaddis",
80+
"title": "The Recognitions",
81+
"review": "One of the great seminal American novels."
82+
}
83+
}
84+
},
85+
"_ingest": {
86+
"timestamp": "2019-03-11T21:54:37.909224Z"
87+
}
88+
}
89+
}
90+
]
91+
}
92+
```
93+
94+
### Filtering empty values
95+
96+
When `ignore_empty_value` is set to `true`, the processor will remove empty elements from the parsed XML:
97+
98+
```console
99+
POST _ingest/pipeline/_simulate
100+
{
101+
"pipeline": {
102+
"processors": [
103+
{
104+
"xml": {
105+
"field": "xml_content",
106+
"target_field": "parsed_xml",
107+
"ignore_empty_value": true
108+
}
109+
}
110+
]
111+
},
112+
"docs": [
113+
{
114+
"_source": {
115+
"xml_content": "<catalog><book><author>William H. Gaddis</author><title></title><review>One of the great seminal American novels.</review><empty/><nested><empty_text> </empty_text><valid_content>Some content</valid_content></nested></book><empty_book></empty_book></catalog>"
116+
}
117+
}
118+
]
119+
}
120+
```
121+
122+
Result with empty elements filtered out:
123+
124+
```console-result
125+
{
126+
"docs": [
127+
{
128+
"doc": {
129+
"_index": "_index",
130+
"_id": "_id",
131+
"_version": "-3",
132+
"_source": {
133+
"xml_content": "<catalog><book><author>William H. Gaddis</author><title></title><review>One of the great seminal American novels.</review><empty/><nested><empty_text> </empty_text><valid_content>Some content</valid_content></nested></book><empty_book></empty_book></catalog>",
134+
"parsed_xml": {
135+
"catalog": {
136+
"book": {
137+
"author": "William H. Gaddis",
138+
"review": "One of the great seminal American novels.",
139+
"nested": {
140+
"valid_content": "Some content"
141+
}
142+
}
143+
}
144+
}
145+
},
146+
"_ingest": {
147+
"timestamp": "2019-03-11T21:54:37.909224Z"
148+
}
149+
}
150+
}
151+
]
152+
}
153+
```
154+
155+
### Converting element names to lowercase
156+
157+
```console
158+
POST _ingest/pipeline/_simulate
159+
{
160+
"pipeline": {
161+
"processors": [
162+
{
163+
"xml": {
164+
"field": "xml_content",
165+
"to_lower": true
166+
}
167+
}
168+
]
169+
},
170+
"docs": [
171+
{
172+
"_source": {
173+
"xml_content": "<Catalog><Book><Author>William H. Gaddis</Author><Title>The Recognitions</Title></Book></Catalog>"
174+
}
175+
}
176+
]
177+
}
178+
```
179+
180+
Result:
181+
182+
```console-result
183+
{
184+
"docs": [
185+
{
186+
"doc": {
187+
"_index": "_index",
188+
"_id": "_id",
189+
"_version": "-3",
190+
"_source": {
191+
"xml_content": "<Catalog><Book><Author>William H. Gaddis</Author><Title>The Recognitions</Title></Book></Catalog>",
192+
"catalog": {
193+
"book": {
194+
"author": "William H. Gaddis",
195+
"title": "The Recognitions"
196+
}
197+
}
198+
},
199+
"_ingest": {
200+
"timestamp": "2019-03-11T21:54:37.909224Z"
201+
}
202+
}
203+
}
204+
]
205+
}
206+
```
207+
208+
### Handling XML attributes
209+
210+
XML attributes are included as properties in the resulting JSON object alongside element content:
211+
212+
```console
213+
POST _ingest/pipeline/_simulate
214+
{
215+
"pipeline": {
216+
"processors": [
217+
{
218+
"xml": {
219+
"field": "xml_content"
220+
}
221+
}
222+
]
223+
},
224+
"docs": [
225+
{
226+
"_source": {
227+
"xml_content": "<catalog version=\"1.0\"><book id=\"123\" isbn=\"978-0-684-80335-9\"><title lang=\"en\">The Recognitions</title><author nationality=\"American\">William H. Gaddis</author></book></catalog>"
228+
}
229+
}
230+
]
231+
}
232+
```
233+
234+
Result:
235+
236+
```console-result
237+
{
238+
"docs": [
239+
{
240+
"doc": {
241+
"_index": "_index",
242+
"_id": "_id",
243+
"_version": "-3",
244+
"_source": {
245+
"xml_content": "<catalog version=\"1.0\"><book id=\"123\" isbn=\"978-0-684-80335-9\"><title lang=\"en\">The Recognitions</title><author nationality=\"American\">William H. Gaddis</author></book></catalog>",
246+
"catalog": {
247+
"version": "1.0",
248+
"book": {
249+
"id": "123",
250+
"isbn": "978-0-684-80335-9",
251+
"title": {
252+
"lang": "en",
253+
"#text": "The Recognitions"
254+
},
255+
"author": {
256+
"nationality": "American",
257+
"#text": "William H. Gaddis"
258+
}
259+
}
260+
}
261+
},
262+
"_ingest": {
263+
"timestamp": "2019-03-11T21:54:37.909224Z"
264+
}
265+
}
266+
}
267+
]
268+
}
269+
```
270+
271+
## XML features
272+
273+
The XML processor supports:
274+
275+
- **Elements with text content**: Converted to key-value pairs where the element name is the key and text content is the value
276+
- **Nested elements**: Converted to nested JSON objects
277+
- **Empty elements**: Converted to `null` values (can be filtered with `ignore_empty_value`)
278+
- **Repeated elements**: Converted to arrays when multiple elements with the same name exist at the same level
279+
- **XML attributes**: Included as properties in the JSON object alongside element content. When an element has both attributes and text content, the text is stored under a special `#text` key
280+
- **Mixed content**: Elements with both text and child elements include text under a special `#text` key while attributes and child elements become object properties
281+
- **Namespaces**: Local names are used, namespace prefixes are ignored

modules/ingest-common/src/main/java/module-info.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
requires org.apache.logging.log4j;
2020
requires org.apache.lucene.analysis.common;
2121
requires org.jruby.joni;
22+
23+
requires java.xml;
2224

2325
exports org.elasticsearch.ingest.common; // for painless
2426

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,8 @@ public Map<String, Processor.Factory> getProcessors(Processor.Parameters paramet
7474
entry(TrimProcessor.TYPE, new TrimProcessor.Factory()),
7575
entry(URLDecodeProcessor.TYPE, new URLDecodeProcessor.Factory()),
7676
entry(UppercaseProcessor.TYPE, new UppercaseProcessor.Factory()),
77-
entry(UriPartsProcessor.TYPE, new UriPartsProcessor.Factory())
77+
entry(UriPartsProcessor.TYPE, new UriPartsProcessor.Factory()),
78+
entry(XmlProcessor.TYPE, new XmlProcessor.Factory())
7879
);
7980
}
8081

0 commit comments

Comments
 (0)