-
Notifications
You must be signed in to change notification settings - Fork 50
refactor: remove file-handle from object store GET operations #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5a107b5
to
afc7f6c
Compare
afc7f6c
to
21531f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about this change -- in theory it is sound, but given how widely used object_store
is I am not sure I have any great idea of how it is used
If we want to proceed with removing the GetResultPayload
I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.
At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile
Do we have any way to measure this? object_store
has 175 reverse dependencies on crates.io (which is significantly more than just parquet and direct dependents): https://crates.io/crates/object_store/reverse_dependencies?page=13
FYI @Xuanwo |
I don't think there's a way to make this non-breaking and TBH I think we have to face one reality: the |
I am not convinced that this particular API needs to be changed and I think we need a better justification than that the API is complicated. The original idea as I understood it, for this API, is that special casing File operations could make significant performance improvements so the more complicated API was justified.
For example, looking at just the DataFusion code, I see quite a few different code paths for https://github.com/search?q=repo%3Aapache%2Fdatafusion%20GetResultPayload%3A%3AFile&type=code I am not claiming that all those cases couldn't be made as performant using a Stream, but I also haven't done the diligence. |
Specifically, I am concerned that this PR removes functionality that there is no alternative to implement with the |
What I see in DataFusion is a bunch of complicated code that is there to deal with the "file" case, not a true optimization for it. |
Which issue does this PR close?
Rationale for this change
The file handle has several drawbacks as it does NOT specify the following properties:
read_at
would be required)At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like
sendfile
.All object store users and integrating libraries potentially have to deal with streams AND file handles.
What changes are included in this PR?
Remove
GetResultPayload::File
(and the entire enum).Are there any user-facing changes?
Files are no longer exposed.