Convert `_get_column_projection_values` to use Field-IDs #2293

Fokko · 2025-08-06T18:26:43Z

Rationale for this change

This is a refactor of the _get_column_projection_values to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime.

Are these changes tested?

Are there any user-facing changes?

Fokko · 2025-08-06T18:27:18Z

tests/expressions/test_visitors.py

+    )
+
+    # Translate column names
+    translated_expr = translate_column_names(bound_expr, file_schema, case_sensitive=True, projected_field_values={2: None})


A partition can be null as well 👍

Fokko · 2025-08-06T18:28:28Z

tests/io/test_pyarrow.py

        with transaction.update_snapshot().overwrite() as update:
            update.append_data_file(unpartitioned_file)

-    schema = pa.schema([("other_field", pa.string()), ("partition_id", pa.int64())])


The IdentityTransform returns the same type as the one in the table:

schema = Schema( NestedField(1, "other_field", StringType(), required=False), NestedField(2, "partition_id", IntegerType(), required=False) )

Fokko · 2025-08-06T18:29:26Z

pyiceberg/expressions/visitors.py

            # In the order described by the "Column Projection" section of the Iceberg spec:
            # https://iceberg.apache.org/spec/#column-projection
            # Evaluate column projection first if it exists
-            if projected_field_value := self.projected_field_values.get(field.name):


Removing wallrus := here since the projected value can also be None

Fokko · 2025-08-06T18:32:24Z

pyiceberg/io/pyarrow.py

+                if partition_value := accessors[partition_field.field_id].get(file.partition):
+                    projected_missing_fields[field_id] = partition_value


I think it makes sense to fail here, rather than suppress the Error. In the case of an IndexError I think your table is corrupt.

This was actually a bug. It always used the current spec, while we should use the spec that it was written with.

good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672

now im worried about all the other places we use .spec()

kevinjqliu

LGTM! Thanks for cleaning this up

kevinjqliu · 2025-08-06T20:33:37Z

should we also address this comment in the PR?
#2029 (comment)

kevinjqliu

LGTM!

I think we can remove test_translate_column_names_missing_column_projected_field_fallbacks_to_initial_default test in tests/expressions/test_visitors.py since you added a new test in this PR
#2029 (comment)

kevinjqliu · 2025-08-07T00:38:39Z

pyiceberg/io/pyarrow.py

+                if partition_value := accessors[partition_field.field_id].get(file.partition):
+                    projected_missing_fields[field_id] = partition_value


good catch, i see its fixed in https://github.com/apache/iceberg-python/pull/2293/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1672

now im worried about all the other places we use .spec()

Fokko · 2025-08-07T06:15:04Z

Should we also address this comment in the PR? #2029 (comment)

Yes we can do that here as well, however, I think it is an extreme edge-case. This is only related to when the partition-spec is set, but the actual fields in the struct are not set. This potentially can cause data-incorrectness in V1 tables (because there the partition struct doesn't have field-IDs). I'm pretty sure that any Iceberg client won't produce such data, but it could be in the case of a poorly written add-files script that ignores partitioning.

Edit: Let's defer that to another PR 👍

Fokko · 2025-08-07T06:18:13Z

Thanks @kevinjqliu for the review 🙌

Fokko · 2025-08-07T14:16:43Z

I've created the follow-up PR here: #2295

# Rationale for this change This is a refactor of the `_get_column_projection_values` to rely on field-IDs rather than names. Field IDs will never change, while partitions and column names can be updated in a tables' lifetime. # Are these changes tested? # Are there any user-facing changes?

Convert _get_column_projection_values to use Field-IDs

10a7c41

Fokko commented Aug 6, 2025

View reviewed changes

kevinjqliu approved these changes Aug 6, 2025

View reviewed changes

kevinjqliu added this to the PyIceberg 0.10.0 milestone Aug 6, 2025

Fokko added 2 commits August 7, 2025 00:23

Fix the CI

b29d80e

Use the spec that it was written with

cbd297f

kevinjqliu approved these changes Aug 7, 2025

View reviewed changes

Fokko merged commit 8042d82 into apache:main Aug 7, 2025
10 checks passed

Fokko deleted the fd-field-ids branch August 7, 2025 06:18

		if partition_value := accessors[partition_field.field_id].get(file.partition):
		projected_missing_fields[field_id] = partition_value

Convert _get_column_projection_values to use Field-IDs #2293

Convert _get_column_projection_values to use Field-IDs #2293

Uh oh!

Conversation

Fokko commented Aug 6, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Aug 6, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Fokko commented Aug 7, 2025

Uh oh!

Fokko commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Convert `_get_column_projection_values` to use Field-IDs #2293

Convert `_get_column_projection_values` to use Field-IDs #2293

Fokko commented Aug 7, 2025 •

edited

Loading