Skip to content

.glob("**/filename") returns incorrect results #1380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mariosasko opened this issue Sep 28, 2023 · 7 comments · Fixed by #1382
Closed

.glob("**/filename") returns incorrect results #1380

mariosasko opened this issue Sep 28, 2023 · 7 comments · Fixed by #1382

Comments

@mariosasko
Copy link
Contributor

Reproducer (requires fsspec>=2023.9.0):

>>> import fsspec
>>> print(fsspec.filesystem("github", org="gpanders", repo="vim-medieval").glob("**/eval[-._ 0-9]*"))
['autoload/medieval.vim', 'doc/medieval.txt', 'ftplugin/markdown/medieval.vim']

**/ is converted to .* in the AbstactFileSystem.glob, so the result contains filenames that don't start with eval, making it incorrect.

This bug can only be reproduced on filesystems that don't resolve the glob path internally, as prefixing the path with <base_path>/ leads to /** taking precedence over **/ in the "glob to regex" conversion, which stops **/ from being replaced by .*.

Maybe the most robust solution would be to align the implementation with glob.glob (behaves as expected in this instance) to stop such inconsistencies from happening in the future. The non-posix behavior is error-prone, so dropping it should be fine, no? I'm happy to submit a PR if this sounds good to you (based on python/cpython#106703).

@martindurant
Copy link
Member

This should be fixed as explained above.

However, I was thinking of changing the URL scheme (or adding an alternative) as /user-or-org/repo/branch-tag-or-commit/path/file rather than using special characters (:, @) and/or kwargs. This is long-winded, but given dirFS, we can always make a sub filesystem at a specific repo/version without dealing with complex URL deconstruction. It would also fix this specific issue for githubFS.

@martindurant
Copy link
Member

I thought glob.glob doesn't support "**" at all?

@mariosasko
Copy link
Contributor Author

However, I was thinking of changing the URL scheme (or adding an alternative) as /user-or-org/repo/branch-tag-or-commit/path/file rather than using special characters (:, @) and/or kwargs. This is long-winded, but given dirFS, we can always make a sub filesystem at a specific repo/version without dealing with complex URL deconstruction. It would also fix this specific issue for githubFS.

I used githubFS for the sake of example - this bug was initially spotted here. Since there are many more such filesystems (I assume), fixing the conversion in glob should be the simplest solution.

I thought glob.glob doesn't support "**" at all?

It does with recursive=True :)

@martindurant
Copy link
Member

there are many more such filesystems

Actually, no. Most (all?) filesystems not requiring an initial "/" are the remote key-value stores (s3fs, gcsfs, abfs), where you are not permitted to do a glob/find at the top level, because that would imply considering all files in all buckets.

@mariosasko
Copy link
Contributor Author

Actually, no. Most (all?) filesystems not requiring an initial "/" are the remote key-value stores (s3fs, gcsfs, abfs), where you are not permitted to do a glob/find at the top level, because that would imply considering all files in all buckets.

But this is allowed for the archive filesystems such as zip FS, etc. Changing their schema is probably not a good idea.

@mariosasko
Copy link
Contributor Author

Opened a PR that relies on glob.translate to fix the issue and align glob with glob.glob.

@martindurant
Copy link
Member

Thanks, will look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants