Skip to content

Conversation

mccormickt12
Copy link
Contributor

@mccormickt12 mccormickt12 commented Aug 6, 2025

Rationale for this change

For hdfs it's common to get scheme and netloc from config and have paths be just the uri. This PR adds properties to configure DEFAULT_SCHEME and DEFAULT_NETLOC

For example

from pyiceberg.catalog import load_catalog

catalog = load_catalog("default", {
    ...,
    "DEFAULT_SCHEME": "hdfs",
    "DEFAULT_NETLOC": "ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000",
}

or if not using catalog

static_table = StaticTable.from_metadata(
    "/warehouse/wh/nyc.db/taxis/metadata/00002-6ea51ce3-62aa-4197-9cf8-43d07c3440ca.metadata.json"
     properties={
        'DEFAULT_SCHEME': 'hdfs',
        'DEFAULT_NETLOC': 'ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000',
    }
)

Previously, schemeless paths are assumed to be for the local filesystem only. This PR allows schemeless paths to be passed to the HDFS Filesystem

Are these changes tested?

Tested in test env at linkedin and with unit tests

Are there any user-facing changes?

No user facing changes by default. If you add these env variables, if file path doesn't have scheme/netloc it'll use the defaults specified.

return uri.scheme, uri.netloc, uri.path

# Load defaults from environment
default_scheme = os.getenv("DEFAULT_SCHEME", "file")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use central config instead of direct usage of env variables? pyiceberg/utils/config.py

this would enable configuration via file OR env variables, which is how most other configs are documented and exposed to catalog construction.

@cbb330
Copy link

cbb330 commented Aug 6, 2025

thanks Tom! overall excited for this PR because I've had to hack around this with an overriden PyArrowFileIO e.g.

class HDFSFileIO(PyArrowFileIO):
    """Simple PyArrowFileIO that defaults paths without scheme to HDFS"""

    @override
    def new_input(self, location: str) -> PyArrowFile:
        """Fix paths without scheme to use HDFS"""
        if not urlparse(location).scheme and location.startswith('/'):
            hdfs_host = self.properties.get(HDFS_HOST, 'localhost')
            hdfs_port = self.properties.get(HDFS_PORT, '9000')
            location = f'hdfs://{hdfs_host}:{hdfs_port}{location}'
        return super().new_input(location)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a comment about passing the properties and structuring the specific code for hdfs

Comment on lines 404 to 413
# Apply logic
scheme = uri.scheme or default_scheme
netloc = uri.netloc or default_netloc

if scheme in ("hdfs", "viewfs"):
return scheme, netloc, uri.path
else:
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"
# For non-HDFS URIs, include netloc in the path if present
path = uri.path if uri.scheme else os.path.abspath(location)
if netloc and not path.startswith(netloc):
path = f"{netloc}{path}"
return scheme, netloc, path
Copy link
Contributor

@kevinjqliu kevinjqliu Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually really want to get rid of this if {scheme} logic here.

Is there a way to refactor these changes down to the _initialize_hdfs_fs? so we can keep the hdfs logic in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see a nice way to do this since the path used in the pyarrowfile is actually different in the different cases, i tried to see if we could use the same path with netloc in it for hdfs but it doesn't seem to work
#2291 (comment)

@mccormickt12
Copy link
Contributor Author

this shows that setting the netloc on filesystem creation and having it in the path (as is done for the other fs types) doesn't work for hdfs

>>> hdfs = fs.HadoopFileSystem(host='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com', port=9000)
25/08/07 17:21:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> table_base = "/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> long_table_base = "ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567"
>>> hdfs.get_file_info(fs.FileSelector(table_base))
25/08/07 17:22:00 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
[<FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json': type=FileType.File, size=2900>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json': type=FileType.File, size=4366>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data': type=FileType.Directory>, <FileInfo for '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata': type=FileType.Directory>]
>>> hdfs.get_file_info(fs.FileSelector(long_table_base))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_fs.pyx", line 582, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] HDFS list directory failed. Detail: [errno 2] No such file or directory

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 10, 2025

What is the proper way to address an absolute path in HadoopFileSystem? your example shows that /path/to/file/ works but {host}/path/to/file does not work. Should {host}/path/to/file also work?

Im trying to see what the requirements are here. I only found examples with hdfs://

Also im curious if HadoopFileSystem.from_uri will work for long_table_base

@mccormickt12
Copy link
Contributor Author

What is the proper way to address an absolute path in HadoopFileSystem? your example shows that /path/to/file/ works but {host}/path/to/file does not work. Should {host}/path/to/file also work?

Im trying to see what the requirements are here. I only found examples with hdfs://

Also im curious if HadoopFileSystem.from_uri will work for long_table_base

It seems to not like the URI passed in

fs = HadoopFileSystem.from_uri("hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000")
path = "/user/tmccormi"
fs.get_file_info(fs.FileSelector(path, recursive=False))
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'pyarrow._hdfs.HadoopFileSystem' object has no attribute 'FileSelector'
fs.get_file_info(path)
25/08/12 18:16:38 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
<FileInfo for '/user/tmccormi': type=FileType.Directory>
print(fs.get_file_info(path))
<FileInfo for '/user/tmccormi': type=FileType.Directory>
path = "hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi"
print(fs.get_file_info(path))
Traceback (most recent call last):
File "", line 1, in
File "pyarrow/_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: GetFileInfo must not be passed a URI, got: hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi

@kevinjqliu
Copy link
Contributor

fs.get_file_info(fs.FileSelector(path, recursive=False))
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'pyarrow._hdfs.HadoopFileSystem' object has no attribute 'FileSelector'

this is actually a syntax problem, i also ran into this while testing. instead of pyarrow.fs which conflicts with the fs object we defined for HadoopFileSystem, can you try

from pyarrow.fs import FileSelector
fs.get_file_info(FileSelector(path, recursive=False))

@kevinjqliu
Copy link
Contributor

Summarizing my thoughts a bit.

For context, heres the way to parse url today when interacting with pyarrow fileio.

  1. A location string is passed to the file io. this is assumed to be an absolute path as required by the iceberg spec. however, "absolute" can mean different things depending on the file system
  2. the location string is then parsed by the PyarrowFileIO's parse_location function. This functions returns the scheme, netloc, and path of the location string. scheme, and netloc is used to determine the FS implementation. path is used by the FS implementation to access the file.

What would be helpful here is to figure out what is the expected format for the location string for HadoopFileSystem.

If it includes the hdfs host name and port, i.e. hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi, parse_location will return the /user/tmccormi as the path.

from urllib.parse import urlparse
urlparse("hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi")

ParseResult(scheme='hdfs', netloc='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000', path='/user/tmccormi', params='', query='', fragment='')

If it does not include the hdfs host name and port, i.e. /user/tmccormi, parse_location will still return /user/tmccormi as the path. But the scheme is empty.

urlparse("/user/tmccormi")

ParseResult(scheme='', netloc='', path='/user/tmccormi', params='', query='', fragment='')

Right now there are 2 places in the codebase where we can make changes to the fileio behavior.

  1. parse_location
  2. Overload the Filesystem implementation

I would like to avoid adding more switch cases in parse_location for ease of maintenance. Perhaps as an alternative, we can let the specific FileSystem implementation handle how it would like to parse_location
I would also like to keep the current way of configuring fileio with just properties, not properties and configs.

@kevinjqliu
Copy link
Contributor

Caught up with @mccormickt12 offline

Here's the issue right now:
HadoopFileSystem currently supports location string with the hdfs:// prefix. However, /path/to/file is also a valid location string. Right now, parse_location assumes that any location string without scheme is handled by the LocalFilesystem and we even add the file scheme

def parse_location(location: str) -> Tuple[str, str, str]:
"""Return the path without the scheme."""
uri = urlparse(location)
if not uri.scheme:
return "file", uri.netloc, os.path.abspath(location)

We need to also support /path/to/file as valid paths for the HadoopFileSystem 👍

@kevinjqliu
Copy link
Contributor

CI errored with something unrelated, let me dig into it

ERROR tests/integration/test_writes/test_writes.py::test_nanosecond_support_on_catalog - TypeError: BodoPyArrowFileIO.parse_location() takes 1 positional argument but 2 were given

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 28, 2025

its because importing bodo will monkey patch PyArrowFileIO. Im going to disable that test for now #2400

@kevinjqliu kevinjqliu mentioned this pull request Aug 28, 2025
kevinjqliu added a commit that referenced this pull request Aug 28, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change
Relates to #2400
Disables importing bodo to allow us to make changes to PyArrowFileIO in
#2291

## Are these changes tested?

## Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
@kevinjqliu
Copy link
Contributor

#2401 is merged. This should fix the CI issue. Could you pull that into this PR?

@kevinjqliu
Copy link
Contributor

@mccormickt12 could you run make lint locally?
Could you also update the title and description of the PR? For title, wdyt of "feat: allow default scheme and netloc for schemeless path"?

@kevinjqliu kevinjqliu changed the title Fix filesystem feat: allow default scheme and netloc for schemeless path Aug 29, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu
Copy link
Contributor

I took the liberty to fix up the title and description

@kevinjqliu kevinjqliu merged commit 19ba343 into apache:main Aug 29, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for the contribution @mccormickt12 and thank you @Fokko @cbb330 for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants