Skip to content

Conversation

jayceslesar
Copy link
Contributor

@jayceslesar jayceslesar commented Jun 21, 2025

Should help with #2130 and #2132

Modifies Table.add_files to explicitly use inspect.data_files and also parallelize inspect._files

I didn't see anywhere else where looping over manifest entries was parallelized, so seems better to parallelize across manifests than within.

No changes here but should be faster.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding the optimization

Co-authored-by: Kevin Liu <[email protected]>
@kevinjqliu
Copy link
Contributor

@jayceslesar looks like the linter isnt happy, could you run make lint?

@jayceslesar
Copy link
Contributor Author

jayceslesar commented Jun 22, 2025

I wonder if we can eventually replace this portion with rust -- unsure what that implementation supports currently but would be fun to try it out https://github.com/apache/iceberg-rust/tree/main/crates/iceberg/src/inspect

@kevinjqliu
Copy link
Contributor

yep thats the goal eventually. I also think the biggest perf bottleneck is reading the avro metadata files. Once we move that part to rust (apache/iceberg-rust#1328), Im curious to see how much the entire process improves.

@kevinjqliu kevinjqliu changed the title perf: table.add_files and inspect.files perf: optimize table.add_files and inspect.files Jun 22, 2025
@kevinjqliu kevinjqliu merged commit e937f6a into apache:main Jun 22, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @jayceslesar

btw i change the PR description so we dont close the 2 issues automatically

amitgilad3 pushed a commit to amitgilad3/iceberg-python that referenced this pull request Jul 7, 2025
Should help with apache#2130 and apache#2132

Modifies `Table.add_files` to explicitly use `inspect.data_files` and
also parallelize `inspect._files`

I didn't see anywhere else where looping over manifest entries was
parallelized, so seems better to parallelize across manifests than
within.

No changes here but should be faster.

---------

Co-authored-by: Kevin Liu <[email protected]>
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
Should help with apache#2130 and apache#2132

Modifies `Table.add_files` to explicitly use `inspect.data_files` and
also parallelize `inspect._files`

I didn't see anywhere else where looping over manifest entries was
parallelized, so seems better to parallelize across manifests than
within.

No changes here but should be faster.

---------

Co-authored-by: Kevin Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants