Skip to content

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 6, 2025

Rationale for this change

This is a refactor of the _get_column_projection_values to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime.

Are these changes tested?

Are there any user-facing changes?

)

# Translate column names
translated_expr = translate_column_names(bound_expr, file_schema, case_sensitive=True, projected_field_values={2: None})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partition can be null as well 👍

with transaction.update_snapshot().overwrite() as update:
update.append_data_file(unpartitioned_file)

schema = pa.schema([("other_field", pa.string()), ("partition_id", pa.int64())])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IdentityTransform returns the same type as the one in the table:

    schema = Schema(
        NestedField(1, "other_field", StringType(), required=False), NestedField(2, "partition_id", IntegerType(), required=False)
    )

# In the order described by the "Column Projection" section of the Iceberg spec:
# https://iceberg.apache.org/spec/#column-projection
# Evaluate column projection first if it exists
if projected_field_value := self.projected_field_values.get(field.name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing wallrus := here since the projected value can also be None

Comment on lines +1417 to +1418
if partition_value := accessors[partition_field.field_id].get(file.partition):
projected_missing_fields[field_id] = partition_value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to fail here, rather than suppress the Error. In the case of an IndexError I think your table is corrupt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug. It always used the current spec, while we should use the spec that it was written with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672

now im worried about all the other places we use .spec()

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for cleaning this up

@kevinjqliu
Copy link
Contributor

should we also address this comment in the PR?
#2029 (comment)

@kevinjqliu kevinjqliu added this to the PyIceberg 0.10.0 milestone Aug 6, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think we can remove test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default test in tests/expressions/test_visitors.py since you added a new test in this PR
#2029 (comment)

Comment on lines +1417 to +1418
if partition_value := accessors[partition_field.field_id].get(file.partition):
projected_missing_fields[field_id] = partition_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672

now im worried about all the other places we use .spec()

@Fokko
Copy link
Contributor Author

Fokko commented Aug 7, 2025

Should we also address this comment in the PR? #2029 (comment)

Yes we can do that here as well, however, I think it is an extreme edge-case. This is only related to when the partition-spec is set, but the actual fields in the struct are not set. This potentially can cause data-incorrectness in V1 tables (because there the partition struct doesn't have field-IDs). I'm pretty sure that any Iceberg client won't produce such data, but it could be in the case of a poorly written add-files script that ignores partitioning.

Edit: Let's defer that to another PR 👍

@Fokko Fokko merged commit 8042d82 into apache:main Aug 7, 2025
10 checks passed
@Fokko Fokko deleted the fd-field-ids branch August 7, 2025 06:18
@Fokko
Copy link
Contributor Author

Fokko commented Aug 7, 2025

Thanks @kevinjqliu for the review 🙌

@Fokko
Copy link
Contributor Author

Fokko commented Aug 7, 2025

I've created the follow-up PR here: #2295

gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
# Rationale for this change

This is a refactor of the `_get_column_projection_values` to rely on
field-IDs rather than names. Field IDs will never change, while
partitions and column names can be updated in a tables' lifetime.

# Are these changes tested?

# Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants