Skip to content

[BUG] scan.filter after reading it as an Arrow table throws #2179

Open
@smaheshwar-pltr

Description

@smaheshwar-pltr

Apache Iceberg version

Most recent PyIceberg

Please describe the bug 🐞

See here and the description below for a failing test.

    table = catalog.load_table(f"default.{identifier}")

    scan = table.scan()
    # assert len(scan.to_arrow()) > 0

    scan = scan.filter("ts >= '2023-03-05T00:00:00+00:00'")
    assert len(scan.to_arrow()) > 0

This code works fine, but uncommenting the first assertion causes the filter call to throw. The stack trace is immediately helpful:

pyiceberg/table/__init__.py:1710: in filter
    return self.update(row_filter=And(self.row_filter, _parse_row_filter(expr)))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyiceberg.table.DataScan object at 0x11c065cd0>
overrides = {'row_filter': GreaterThanOrEqual(term=Reference(name='ts'), literal=literal('2023-03-05T00:00:00+00:00'))}

    def update(self: S, **overrides: Any) -> S:
        """Create a copy of this table scan with updated fields."""
>       return type(self)(**{**self.__dict__, **overrides})
E       TypeError: TableScan.__init__() got an unexpected keyword argument 'partition_filters'

pyiceberg/table/__init__.py:1694: TypeError

DataScan has a cached_property partition_filters (see here) that will turn up in self.__dict__ below in the update method:

def update(self: S, **overrides: Any) -> S:
"""Create a copy of this table scan with updated fields."""
return type(self)(**{**self.__dict__, **overrides})

This will happen if the cache property has been accessed once - i.e. if the scan has already had plan_files called on it (essentially, if it's been read).

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions