Skip to content

refactor: remove file-handle from object store GET operations #382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

crepererum
Copy link
Contributor

@crepererum crepererum commented May 26, 2025

Which issue does this PR close?

Rationale for this change

The file handle has several drawbacks as it does NOT specify the following properties:

  • mode: Is the file opened read-only or does it also have write access (which would be bad)?
  • seek position: At which position is the current file "read head"? Are users expected to seek to the file start if they want to read the entire file?
  • direct IO: Is the file opened using direct IO? (in which case the user would have to read in aligned blocks)
  • exclusive access: Does the API user have exclusive access to the said file handle (and hence can use seek-based operations) or not (in which case only non-seeking operations like read_at would be required)

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile.

All object store users and integrating libraries potentially have to deal with streams AND file handles.

What changes are included in this PR?

Remove GetResultPayload::File (and the entire enum).

Are there any user-facing changes?

Files are no longer exposed.

@crepererum crepererum added the next-major-release the PR has API changes and it waiting on the next major version label May 26, 2025
@crepererum crepererum force-pushed the crepererum/issue18 branch 2 times, most recently from 5a107b5 to afc7f6c Compare May 26, 2025 14:16
@crepererum crepererum force-pushed the crepererum/issue18 branch from afc7f6c to 21531f3 Compare June 16, 2025 08:17
@crepererum crepererum marked this pull request as ready for review June 16, 2025 08:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this change -- in theory it is sound, but given how widely used object_store is I am not sure I have any great idea of how it is used

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile

Do we have any way to measure this? object_store has 175 reverse dependencies on crates.io (which is significantly more than just parquet and direct dependents): https://crates.io/crates/object_store/reverse_dependencies?page=13

Screenshot 2025-06-17 at 6 35 33 AM

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

FYI @Xuanwo

@crepererum
Copy link
Contributor Author

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

I don't think there's a way to make this non-breaking and TBH I think we have to face one reality: the object_store interface is messy at the moment and to make it 1.0 proof, there will be breaking changes; and not all of them can be phased.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

I don't think there's a way to make this non-breaking and TBH I think we have to face one reality: the object_store interface is messy at the moment and to make it 1.0 proof, there will be breaking changes; and not all of them can be phased.

I am not convinced that this particular API needs to be changed and I think we need a better justification than that the API is complicated. The original idea as I understood it, for this API, is that special casing File operations could make significant performance improvements so the more complicated API was justified.

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile.

For example, looking at just the DataFusion code, I see quite a few different code paths for File vs Stream:

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20GetResultPayload%3A%3AFile&type=code

I am not claiming that all those cases couldn't be made as performant using a Stream, but I also haven't done the diligence.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Specifically, I am concerned that this PR removes functionality that there is no alternative to implement with the object_store crate, rather than just making the API easier to work with

@crepererum
Copy link
Contributor Author

What I see in DataFusion is a bunch of complicated code that is there to deal with the "file" case, not a true optimization for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove file-handle from object store GET operations
2 participants