Skip to content

Add support for Bodo DataFrame #2167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ehsantn
Copy link

@ehsantn ehsantn commented Jul 3, 2025

Rationale for this change

Adds support for Bodo DataFrame library, which is a drop in replacement for Pandas that accelerates and scales Python code automatically by applying query, compiler and HPC optimizations.

Are these changes tested?

Added integration test.

Are there any user-facing changes?

Adds Table.to_bodo() function. Example code:

df = table.to_bodo()  # equivalent to `bodo.pandas.read_iceberg_table(table)`
df = df[df["trip_distance"] >= 10.0]
df = df[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"]]
print(df)

@ehsantn ehsantn marked this pull request as ready for review July 4, 2025 02:22
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu
Copy link
Contributor

@ehsantn looks like theres an issue with the dependency resolution

poetry install --all-extras
Installing dependencies from lock file

The current project's supported Python range (3.9.23) is not compatible with some of the required packages Python requirement:
  - numpy requires Python >=3.10, so it will not be installable for Python 3.9.23

Because no versions of pandas match >=1.0.0,<2.3.0 || >2.3.0,<3.0.0
 and pandas (2.3.0) depends on numpy (>=1.22.4), pandas (>=1.0.0,<3.0.0) requires numpy (>=1.22.4).
Because numpy (2.2.6) requires Python >=3.10
 and no versions of numpy match >=1.22.4,<2.2.6 || >2.2.6, numpy is forbidden.
Thus, pandas is forbidden.
So, because pyiceberg depends on pandas (>=1.0.0,<3.0.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For numpy, a possible solution would be to set the `python` property to "<empty>"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

make: *** [Makefile:63: install-dependencies] Error 1
Error: Process completed with exit code 2.

@ehsantn
Copy link
Author

ehsantn commented Jul 5, 2025

@kevinjqliu Thanks for the quick review. Bodo requires Python >=3.10 since Python 3.9 has been removed by some dependency packages quite a while ago. Do all optional dependencies of PyIceberg need to support Python 3.9? What do you recommend?
I can try to package Bodo for 3.9 with some workarounds if there is no other solution.

https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table (Python 3.9 is removed since Apr 05, 2024).
https://pypi.org/project/numba (3.10+)

optional = true
python-versions = ">=3.9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like numpy for 3.9 is removed here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy is here: https://github.com/ehsantn/iceberg-python/blob/f36265b8cdc9fa3056ad28784467579514cfc850/poetry.lock#L3424
I'm working on packaging Bodo for Python 3.9 to avoid these Poetry issues: bodo-ai/Bodo#637
Our team will just miss structured pattern matching and better type hints of Python 3.10 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg 3.9 is also EOL in a few months (2025-10)
https://devguide.python.org/versions/#supported-versions

@ehsantn ehsantn changed the title Added support for Bodo DataFrame Add support for Bodo DataFrame Jul 7, 2025
@ehsantn
Copy link
Author

ehsantn commented Jul 7, 2025

Ok, updated Bodo to support Python 3.9 so this should work now. Tried poetry install --all-extras in an Ubuntu environment and it works.

@kevinjqliu
Copy link
Contributor

@ehsantn i merged a few library upgrades. could you rebase this PR?

@ehsantn
Copy link
Author

ehsantn commented Jul 8, 2025

@ehsantn i merged a few library upgrades. could you rebase this PR?

Done. I assume the CI failure is not related to this PR? The test doesn't seem relevant.

@kevinjqliu
Copy link
Contributor

maybe try rebase main again, idk what CI is doing

@ehsantn
Copy link
Author

ehsantn commented Jul 8, 2025

Done. No idea why the CI fails here in unrelated tests. Maybe some dependency got upgraded in the lock file?

@ehsantn
Copy link
Author

ehsantn commented Jul 8, 2025

Looks like the Bodo test is actually failing (logs are not very visible). Seems to be just a configuration issue. Working on a fix.
https://github.com/apache/iceberg-python/actions/runs/16132597025/job/45522782328#step:5:2479

@ehsantn
Copy link
Author

ehsantn commented Jul 8, 2025

All tests are passing locally for me now. Hopefully the CI works too.

Comment on lines +457 to +458
under_20_arrow = version.parse(pyarrow.__version__) < version.parse("20.0.0")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should find another way to make these tests pass instead of branching on pyarrow version

Copy link
Author

@ehsantn ehsantn Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas? Maybe use a range of "safe" values instead of a single file size value? I'd be happy to open another PR if there is more work for this.

Bodo is currently pinned to Arrow 19 since the current release version of PyIceberg supports up to Arrow 19. Bodo uses Arrow C++, which currently requires pinning to a single version for pip wheels to work (conda-forge builds against 4 latest Arrow versions in this case but pip doesn't support this yet). It'd be great if PyIceberg wouldn't set an upper version for Arrow if possible.

@ehsantn
Copy link
Author

ehsantn commented Jul 11, 2025

@kevinjqliu please advise on next steps. This PR looks ready to merge to me and the flakiness of existing unit tests should be addressed separately (would be happy to contribute if priority).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants