-
Notifications
You must be signed in to change notification settings - Fork 378
Convert _get_column_projection_values
to use Field-IDs
#2293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
) | ||
|
||
# Translate column names | ||
translated_expr = translate_column_names(bound_expr, file_schema, case_sensitive=True, projected_field_values={2: None}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A partition can be null as well 👍
with transaction.update_snapshot().overwrite() as update: | ||
update.append_data_file(unpartitioned_file) | ||
|
||
schema = pa.schema([("other_field", pa.string()), ("partition_id", pa.int64())]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IdentityTransform
returns the same type as the one in the table:
schema = Schema(
NestedField(1, "other_field", StringType(), required=False), NestedField(2, "partition_id", IntegerType(), required=False)
)
# In the order described by the "Column Projection" section of the Iceberg spec: | ||
# https://iceberg.apache.org/spec/#column-projection | ||
# Evaluate column projection first if it exists | ||
if projected_field_value := self.projected_field_values.get(field.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing wallrus :=
here since the projected value can also be None
if partition_value := accessors[partition_field.field_id].get(file.partition): | ||
projected_missing_fields[field_id] = partition_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to fail here, rather than suppress the Error. In the case of an IndexError
I think your table is corrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a bug. It always used the current spec, while we should use the spec that it was written with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672
now im worried about all the other places we use .spec()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for cleaning this up
should we also address this comment in the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think we can remove test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default
test in tests/expressions/test_visitors.py
since you added a new test in this PR
#2029 (comment)
if partition_value := accessors[partition_field.field_id].get(file.partition): | ||
projected_missing_fields[field_id] = partition_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672
now im worried about all the other places we use .spec()
Yes we can do that here as well, however, I think it is an extreme edge-case. This is only related to when the partition-spec is set, but the actual fields in the struct are not set. This potentially can cause data-incorrectness in V1 tables (because there the partition struct doesn't have field-IDs). I'm pretty sure that any Iceberg client won't produce such data, but it could be in the case of a poorly written add-files script that ignores partitioning. Edit: Let's defer that to another PR 👍 |
Thanks @kevinjqliu for the review 🙌 |
I've created the follow-up PR here: #2295 |
# Rationale for this change This is a refactor of the `_get_column_projection_values` to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime. # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
Rationale for this change
This is a refactor of the
_get_column_projection_values
to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime.Are these changes tested?
Are there any user-facing changes?