-
Notifications
You must be signed in to change notification settings - Fork 378
Fix support for writing to nested field partition #2204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM! Good catch!
Heres the corresponding spec on partition columns
The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
maybe we should also gate on map/list.
pyiceberg/io/pyarrow.py
Outdated
Returns: | ||
The unnested field as a PyArrow Array | ||
""" | ||
if "." not in field_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine since we use "."
to implicitly reference nested fields
iceberg-python/pyiceberg/expressions/parser.py
Lines 100 to 102 in f475b8e
@column.set_parse_action | |
def _(result: ParseResults) -> Reference: | |
return Reference(".".join(result.column)) |
iceberg-python/pyiceberg/table/update/schema.py
Lines 167 to 171 in f475b8e
Because "." may be interpreted as a column path separator or may be used in field names, it | |
is not allowed to add nested column by passing in a string. To add to nested structures or | |
to add fields with names that contain "." use a tuple instead to indicate the path. | |
If type is a nested type, its field IDs are reassigned when added to the existing schema. |
pyiceberg/io/pyarrow.py
Outdated
field_array = arrow_table[path_parts[0]] | ||
field_array = pc.struct_field(field_array, path_parts[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, so we first reference the struct field in the pa.Table and then navigate to it using struct_field
's indices by name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add this as a comment since it was not obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes apache#2095 # Rationale for this change Currently, we can only partition on top-level valid field types, but this PR adds support for partitioning on primitive fields in a struct type using dot notation to determine the partitions against the nested structure. # Are these changes tested? Yes added tests and tested a write against the problem in the above issue. ``` > aws s3 ls s3://myBucket/demo1/nestedPartition/data/ PRE timestamp_hour=2024-01-15-10/ PRE timestamp_hour=2024-01-15-11/ PRE timestamp_hour=2024-04-15-11/ PRE timestamp_hour=2024-05-15-10/ ``` # Are there any user-facing changes? no but now can add data to tables that are partitioned by a source column that's in a struct
Closes #2095
Rationale for this change
Currently, we can only partition on top-level valid field types, but this PR adds support for partitioning on primitive fields in a struct type using dot notation to determine the partitions against the nested structure.
Are these changes tested?
Yes added tests and tested a write against the problem in the above issue.
Are there any user-facing changes?
no but now can add data to tables that are partitioned by a source column that's in a struct