parallelize `add_files` #1717

vtk9 · 2025-02-25T00:57:06Z

parquet_files_to_data_files changed to parquet_file_to_data_files which processes a single parquet file and returns a DataFile
_parquet_files_to_data_files uses internal ExecutorFactory

resolves #1335

amitgilad3

Great work!!! , i think you need to run make lint

amitgilad3 · 2025-02-25T10:41:46Z

tests/integration/test_add_files.py

        tbl.add_files(file_paths=file_paths)


+@pytest.mark.integration


Im not sure this test is any different from existing test (but maybe im mistaken), the test verifies that the files are correctly added to the table—by checking that the manifest counts match the expected values—but it does not directly assert that the file additions are processed concurrently.
we already have tests that add multiple file

added a better that asserts that files are processed within different threads.
let me know if this is sufficient.

tests in CICD was failing. it should be fixed now

really cool way to check that files are processed within different threads. !!

Fokko · 2025-02-25T15:32:37Z

pyiceberg/io/pyarrow.py

+            f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
        )
+    schema = table_metadata.schema()
+    _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())


We're converting the schema multiple times. Since we're optimizing for performance now, we probably want to store this in a variable to reduce GIL congestion

i assume you mean schema = table_metadata.schema() I made that change.

assuming you meant _check_pyarrow_schema_compatible, I don't think it is possible to only compute that once since it relies on parquet_metadata which is technically unique to each parquet file

to_arrow_schema() is called on lines 2483 and 2479. Would be good to call this just once

got it. should be fixed now

Fokko · 2025-02-25T15:38:01Z

pyiceberg/io/pyarrow.py

-            equality_ids=None,
-            key_metadata=None,
-            **statistics.to_serialized_dict(),
+def parquet_file_to_data_file(io: FileIO, table_metadata: TableMetadata, file_path: str) -> DataFile:


We're changing a public API here. We either have to go through the deprecation cycle, or add a new method parquet_file_to_data_file, that's being used by parquet_file_to_data_files

i re-added parquet_files_to_data_files (plural). internally it uses parquet_file_to_data_file (singular)

another alternative is to make parquet_file_to_data_file private. what do you think?

If we want to make parquet_file_to_date_file private, then we have to go through the deprecation cycle as well. I think we can leave it in for now.

- add a better test which checks multiple threads used during execution - re-add `parquet_files_to_data_files` back and let it use `parquet_file_to_data_file` - move `schema = table_metadata.schema()` outside of function it is being used in

pyiceberg/io/pyarrow.py

fix integration to be more robust

Fokko

Thanks @vtk9 for adding this, and thanks @amitgilad3 for the review 🚀

- `parquet_files_to_data_files` changed to `parquet_file_to_data_files` which processes a single parquet file and returns a `DataFile` - `_parquet_files_to_data_files` uses internal ExecutorFactory resolves apache#1335

vtk9 added 4 commits February 24, 2025 15:20

make add_files parallelized

d5f3fdc

change number of files uploaded

b5bfda5

prep for PR submit

efddf11

remove comment

72e1f4b

amitgilad3 reviewed Feb 25, 2025

View reviewed changes

Fokko reviewed Feb 25, 2025

View reviewed changes

vtk9 added 2 commits February 26, 2025 17:59

address PR comments

be04e9c

- add a better test which checks multiple threads used during execution - re-add `parquet_files_to_data_files` back and let it use `parquet_file_to_data_file` - move `schema = table_metadata.schema()` outside of function it is being used in

forgot to remove schema = table_metadata.schema()

ce57944

Fokko reviewed Feb 27, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

call parquet_metadata.schema.to_arrow_schema() once

03208ca

fix integration to be more robust

Fokko approved these changes Mar 3, 2025

View reviewed changes

Fokko merged commit f942551 into apache:main Mar 3, 2025
7 checks passed

		tbl.add_files(file_paths=file_paths)


		@pytest.mark.integration

parallelize add_files #1717

parallelize add_files #1717

Uh oh!

Conversation

vtk9 commented Feb 25, 2025

Uh oh!

amitgilad3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

parallelize `add_files` #1717

parallelize `add_files` #1717