-
Notifications
You must be signed in to change notification settings - Fork 378
feat: allow default scheme and netloc for schemeless path #2291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pyiceberg/io/pyarrow.py
Outdated
return uri.scheme, uri.netloc, uri.path | ||
|
||
# Load defaults from environment | ||
default_scheme = os.getenv("DEFAULT_SCHEME", "file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use central config instead of direct usage of env variables? pyiceberg/utils/config.py
this would enable configuration via file OR env variables, which is how most other configs are documented and exposed to catalog construction.
thanks Tom! overall excited for this PR because I've had to hack around this with an overriden PyArrowFileIO e.g. class HDFSFileIO(PyArrowFileIO):
"""Simple PyArrowFileIO that defaults paths without scheme to HDFS"""
@override
def new_input(self, location: str) -> PyArrowFile:
"""Fix paths without scheme to use HDFS"""
if not urlparse(location).scheme and location.startswith('/'):
hdfs_host = self.properties.get(HDFS_HOST, 'localhost')
hdfs_port = self.properties.get(HDFS_PORT, '9000')
location = f'hdfs://{hdfs_host}:{hdfs_port}{location}'
return super().new_input(location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I added a comment about passing the properties and structuring the specific code for hdfs
pyiceberg/io/pyarrow.py
Outdated
# Apply logic | ||
scheme = uri.scheme or default_scheme | ||
netloc = uri.netloc or default_netloc | ||
|
||
if scheme in ("hdfs", "viewfs"): | ||
return scheme, netloc, uri.path | ||
else: | ||
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}" | ||
# For non-HDFS URIs, include netloc in the path if present | ||
path = uri.path if uri.scheme else os.path.abspath(location) | ||
if netloc and not path.startswith(netloc): | ||
path = f"{netloc}{path}" | ||
return scheme, netloc, path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually really want to get rid of this if {scheme} logic here.
Is there a way to refactor these changes down to the _initialize_hdfs_fs
? so we can keep the hdfs logic in the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see a nice way to do this since the path used in the pyarrowfile is actually different in the different cases, i tried to see if we could use the same path with netloc in it for hdfs but it doesn't seem to work
#2291 (comment)
this shows that setting the netloc on filesystem creation and having it in the path (as is done for the other fs types) doesn't work for hdfs
|
What is the proper way to address an absolute path in HadoopFileSystem? your example shows that Im trying to see what the requirements are here. I only found examples with Also im curious if |
It seems to not like the URI passed in
|
this is actually a syntax problem, i also ran into this while testing. instead of
|
Summarizing my thoughts a bit. For context, heres the way to parse url today when interacting with pyarrow fileio.
What would be helpful here is to figure out what is the expected format for the If it includes the hdfs host name and port, i.e.
If it does not include the hdfs host name and port, i.e.
Right now there are 2 places in the codebase where we can make changes to the fileio behavior.
I would like to avoid adding more switch cases in |
Caught up with @mccormickt12 offline Here's the issue right now: iceberg-python/pyiceberg/io/pyarrow.py Lines 395 to 399 in a7f6c08
We need to also support |
48f201a
to
b40e4e8
Compare
CI errored with something unrelated, let me dig into it
|
its because importing bodo will monkey patch |
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change Relates to #2400 Disables importing bodo to allow us to make changes to PyArrowFileIO in #2291 ## Are these changes tested? ## Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
#2401 is merged. This should fix the CI issue. Could you pull that into this PR? |
…pecified in file path
8c4af81
to
35f4cdc
Compare
@mccormickt12 could you run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I took the liberty to fix up the title and description |
Thanks for the contribution @mccormickt12 and thank you @Fokko @cbb330 for the review! |
Rationale for this change
For hdfs it's common to get scheme and netloc from config and have paths be just the uri. This PR adds properties to configure
DEFAULT_SCHEME
andDEFAULT_NETLOC
For example
or if not using catalog
Previously, schemeless paths are assumed to be for the local filesystem only. This PR allows schemeless paths to be passed to the HDFS Filesystem
Are these changes tested?
Tested in test env at linkedin and with unit tests
Are there any user-facing changes?
No user facing changes by default. If you add these env variables, if file path doesn't have scheme/netloc it'll use the defaults specified.