Skip to content

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Jul 12, 2025

Closes #2095

Rationale for this change

Currently, we can only partition on top-level valid field types, but this PR adds support for partitioning on primitive fields in a struct type using dot notation to determine the partitions against the nested structure.

Are these changes tested?

Yes added tests and tested a write against the problem in the above issue.

> aws s3 ls s3://myBucket/demo1/nestedPartition/data/
                           PRE timestamp_hour=2024-01-15-10/
                           PRE timestamp_hour=2024-01-15-11/
                           PRE timestamp_hour=2024-04-15-11/
                           PRE timestamp_hour=2024-05-15-10/

Are there any user-facing changes?

no but now can add data to tables that are partitioned by a source column that's in a struct

@geruh geruh changed the title Add support for nested field partitioning Fix support for writing to nested field partition Jul 12, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! Good catch!

Heres the corresponding spec on partition columns

The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.

maybe we should also gate on map/list.

Returns:
The unnested field as a PyArrow Array
"""
if "." not in field_path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine since we use "." to implicitly reference nested fields

@column.set_parse_action
def _(result: ParseResults) -> Reference:
return Reference(".".join(result.column))

Because "." may be interpreted as a column path separator or may be used in field names, it
is not allowed to add nested column by passing in a string. To add to nested structures or
to add fields with names that contain "." use a tuple instead to indicate the path.
If type is a nested type, its field IDs are reassigned when added to the existing schema.

Comment on lines 2786 to 2787
field_array = arrow_table[path_parts[0]]
field_array = pc.struct_field(field_array, path_parts[1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so we first reference the struct field in the pa.Table and then navigate to it using struct_field's indices by name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this as a comment since it was not obvious

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu kevinjqliu merged commit ad8263b into apache:main Jul 15, 2025
10 checks passed
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
Closes apache#2095 

# Rationale for this change
Currently, we can only partition on top-level valid field types, but
this PR adds support for partitioning on primitive fields in a struct
type using dot notation to determine the partitions against the nested
structure.

# Are these changes tested?
Yes added tests and tested a write against the problem in the above
issue.

```
> aws s3 ls s3://myBucket/demo1/nestedPartition/data/
                           PRE timestamp_hour=2024-01-15-10/
                           PRE timestamp_hour=2024-01-15-11/
                           PRE timestamp_hour=2024-04-15-11/
                           PRE timestamp_hour=2024-05-15-10/
```

# Are there any user-facing changes?

no but now can add data to tables that are partitioned by a source
column that's in a struct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't partition by nested field

2 participants