-
Notifications
You must be signed in to change notification settings - Fork 378
parallelize add_files
#1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize add_files
#1717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!!! , i think you need to run make lint
tbl.add_files(file_paths=file_paths) | ||
|
||
|
||
@pytest.mark.integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure this test is any different from existing test (but maybe im mistaken), the test verifies that the files are correctly added to the table—by checking that the manifest counts match the expected values—but it does not directly assert that the file additions are processed concurrently.
we already have tests that add multiple file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a better that asserts that files are processed within different threads.
let me know if this is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests in CICD was failing. it should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really cool way to check that files are processed within different threads. !!
pyiceberg/io/pyarrow.py
Outdated
f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" | ||
) | ||
schema = table_metadata.schema() | ||
_check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're converting the schema multiple times. Since we're optimizing for performance now, we probably want to store this in a variable to reduce GIL congestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume you mean schema = table_metadata.schema()
I made that change.
assuming you meant _check_pyarrow_schema_compatible
, I don't think it is possible to only compute that once since it relies on parquet_metadata
which is technically unique to each parquet file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_arrow_schema()
is called on lines 2483 and 2479. Would be good to call this just once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. should be fixed now
equality_ids=None, | ||
key_metadata=None, | ||
**statistics.to_serialized_dict(), | ||
def parquet_file_to_data_file(io: FileIO, table_metadata: TableMetadata, file_path: str) -> DataFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're changing a public API here. We either have to go through the deprecation cycle, or add a new method parquet_file_to_data_file
, that's being used by parquet_file_to_data_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i re-added parquet_files_to_data_files (plural). internally it uses parquet_file_to_data_file
(singular)
another alternative is to make parquet_file_to_data_file
private. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to make parquet_file_to_date_file
private, then we have to go through the deprecation cycle as well. I think we can leave it in for now.
- add a better test which checks multiple threads used during execution - re-add `parquet_files_to_data_files` back and let it use `parquet_file_to_data_file` - move `schema = table_metadata.schema()` outside of function it is being used in
fix integration to be more robust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vtk9 for adding this, and thanks @amitgilad3 for the review 🚀
- `parquet_files_to_data_files` changed to `parquet_file_to_data_files` which processes a single parquet file and returns a `DataFile` - `_parquet_files_to_data_files` uses internal ExecutorFactory resolves apache#1335
parquet_files_to_data_files
changed toparquet_file_to_data_files
which processes a single parquet file and returns aDataFile
_parquet_files_to_data_files
uses internal ExecutorFactoryresolves #1335