Fix support for writing to nested field partition #2204

geruh · 2025-07-12T18:54:54Z

Closes #2095

Rationale for this change

Currently, we can only partition on top-level valid field types, but this PR adds support for partitioning on primitive fields in a struct type using dot notation to determine the partitions against the nested structure.

Are these changes tested?

Yes added tests and tested a write against the problem in the above issue.

> aws s3 ls s3://myBucket/demo1/nestedPartition/data/
                           PRE timestamp_hour=2024-01-15-10/
                           PRE timestamp_hour=2024-01-15-11/
                           PRE timestamp_hour=2024-04-15-11/
                           PRE timestamp_hour=2024-05-15-10/

Are there any user-facing changes?

no but now can add data to tables that are partitioned by a source column that's in a struct

kevinjqliu

Generally LGTM! Good catch!

Heres the corresponding spec on partition columns

The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.

maybe we should also gate on map/list.

tests/io/test_pyarrow.py

kevinjqliu · 2025-07-15T03:38:19Z

pyiceberg/io/pyarrow.py

+    Returns:
+        The unnested field as a PyArrow Array
+    """
+    if "." not in field_path:


this is fine since we use "." to implicitly reference nested fields

iceberg-python/pyiceberg/expressions/parser.py

Lines 100 to 102 in f475b8e

@column.set_parse_action

def _(result: ParseResults) -> Reference:

return Reference(".".join(result.column))

iceberg-python/pyiceberg/table/update/schema.py

Lines 167 to 171 in f475b8e

Because "." may be interpreted as a column path separator or may be used in field names, it

is not allowed to add nested column by passing in a string. To add to nested structures or

to add fields with names that contain "." use a tuple instead to indicate the path.

If type is a nested type, its field IDs are reassigned when added to the existing schema.

kevinjqliu · 2025-07-15T03:39:49Z

pyiceberg/io/pyarrow.py

+    field_array = arrow_table[path_parts[0]]
+    field_array = pc.struct_field(field_array, path_parts[1:])


interesting, so we first reference the struct field in the pa.Table and then navigate to it using struct_field's indices by name

maybe add this as a comment since it was not obvious

kevinjqliu

LGTM!

Closes apache#2095 # Rationale for this change Currently, we can only partition on top-level valid field types, but this PR adds support for partitioning on primitive fields in a struct type using dot notation to determine the partitions against the nested structure. # Are these changes tested? Yes added tests and tested a write against the problem in the above issue. ``` > aws s3 ls s3://myBucket/demo1/nestedPartition/data/ PRE timestamp_hour=2024-01-15-10/ PRE timestamp_hour=2024-01-15-11/ PRE timestamp_hour=2024-04-15-11/ PRE timestamp_hour=2024-05-15-10/ ``` # Are there any user-facing changes? no but now can add data to tables that are partitioned by a source column that's in a struct

Add support for partitioning by nested columns

4acf26b

geruh changed the title ~~Add support for nested field partitioning~~ Fix support for writing to nested field partition Jul 12, 2025

remove type ignore

cac298b

kevinjqliu approved these changes Jul 15, 2025

View reviewed changes

allow field names with dots and add test

665d543

kevinjqliu approved these changes Jul 15, 2025

View reviewed changes

kevinjqliu merged commit ad8263b into apache:main Jul 15, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix support for writing to nested field partition #2204

Fix support for writing to nested field partition #2204

Uh oh!

geruh commented Jul 12, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

kevinjqliu Jul 15, 2025

Uh oh!

kevinjqliu Jul 15, 2025

Uh oh!

kevinjqliu Jul 15, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	@column.set_parse_action
	def _(result: ParseResults) -> Reference:
	return Reference(".".join(result.column))

	Because "." may be interpreted as a column path separator or may be used in field names, it
	is not allowed to add nested column by passing in a string. To add to nested structures or
	to add fields with names that contain "." use a tuple instead to indicate the path.

	If type is a nested type, its field IDs are reassigned when added to the existing schema.

		field_array = arrow_table[path_parts[0]]
		field_array = pc.struct_field(field_array, path_parts[1:])

Fix support for writing to nested field partition #2204

Fix support for writing to nested field partition #2204

Uh oh!

Conversation

geruh commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geruh commented Jul 12, 2025 •

edited

Loading