-
Notifications
You must be signed in to change notification settings - Fork 65
Add git to filesystem b 301 #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
764ce42
support gitpythonfs fsspec implementation
deanja 64d1d9d
dynamically create test git repo
ff9228a
Use dlt from feature branch
fbbec91
improve readability of test parameters
229f7a5
use distinct filesystem instances in pipeline threads
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
support gitpythonfs fsspec implementation
- also introduces support for passing keyword argument to fsspec. gitpythonfs is the first use case where the kwargs are needed. - requires future version of dlt that supports gitpython and dynamic registration of fssepc implementations. - requires manual creation of a git repo for testing gitpythonfs - if kwargs are used to construct filesystem instances then FileItemDict must be instantiated with an existing instance of AbstractFileSystem. Instantiating FileItemDict with FileSystemCredentials will omit the fs kwargs and cause unexpected behaviour.
- Loading branch information
commit 764ce42e36cd2bd0f56ff14569f9e64659eb9b3e
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Git repo for testing | ||
|
||
The `./git`folder contains a bare repo used for running tests for the `filesystem` dlt Source. | ||
|
||
# Usage | ||
|
||
For example, use it to test a pipeline that reads files using the `gitpythonfs` fsspec implementation. | ||
|
||
The repo is not needed for regular use of dlt. | ||
|
||
For the tests to pass, use the tag (aka `ref`) called `unmodified-samples`. Using HEAD (the default) is intended to fail tests due to modifications such as a file not having the expected file name. It allows testing of the `ref` functionality of git-based fsspec implementations. | ||
|
||
Some features of the repo are intentionally different to the containing repo (eg verified-sources repo) to help prevent mistakenly testing against (or modifying!) the wrong repo: | ||
|
||
- The default branch is `cases-master` | ||
- the sample files root folder is `samples`, not `tests`. | ||
|
||
# Configuration | ||
|
||
When to configure (build?): | ||
- When setting up an environment - CI, local dev etc. (Unless it's now committed in verified-sources repo) | ||
- After modifying any content in `../samples folder` | ||
|
||
Ideally the repo will be created idempotently by a pytest fixture, `make` script or similar. Until then, these are the manual steps to idempotently create/recreate: | ||
|
||
1. Set working directory to `tests/filesystem/cases/git` | ||
2. Check the current folder contains only `.git`. eg `ls -a`. It's also ok if | ||
the current folder is empty. | ||
3. Delete `.git` and all subfolders. ie, delete the repo. `rm -rf` | ||
4. Make a fresh repo using: | ||
|
||
``` | ||
git init | ||
git checkout -b cases-master | ||
``` | ||
|
||
5. Copy in the folder `../../samples`. ie samples folder and all its contents. eg `cp -r ../../samples .` | ||
6. Put some object in the repo: | ||
|
||
``` | ||
git add --all | ||
git commit -m "add standard sample files for tests" | ||
git tag -a unmodified-samples -m "The sample test files with no modifications" | ||
git mv samples/sample.txt samples/sample_renamed.txt | ||
git commit -m "rename samples.txt to make tests fail" | ||
``` | ||
|
||
5. Delete all working files, except `.git`. eg with `rm -rf samples`. (ToDo: that's not officially not a bare repo. Use `git clone --bare path/to/repo` instead. Maybe we create the repo in a temp folder and then bare clone it into `cases/git` folder, discard the temp folder.) | ||
|
||
# Developing | ||
|
||
Note that at least one IDE - VSCode - does not recognise this repo in the Explorer and Source Control tabs. Likely because it is a repo inside another repo. | ||
|
||
If you are considering committing the repo to its containing repo - eg the verified-sources repo - consider the effects on the size of the containing repo: | ||
|
||
`du -sh ./.git/*` | ||
|
||
That's about the same as the samples folder itself. BUT consider that the largest file, `./.git/objects` might change often if the repo is regenerated. So it might be better to ensure the repo is gitignored (so meta!) and "build" it in each environment where needed. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
from typing import Any, Dict, List | ||
|
||
|
||
def unpack_factory_args(factory_args: Dict[str, Any]) -> List[Any]: | ||
"""Unpacks filesystem factory arguments from pytest parameters.""" | ||
return [factory_args.get(k) for k in ("bucket_url", "kwargs")] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is way better if we pass
credentials
here. it often happens that file items are processed in different threads and I'd avoid sharing a single instance of filesystem. if you need to pass additional kwargs I think we have support in credentials for those. so you can just setkwargs
to credentialsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok now I see: we should not pass
credentials
toFileItemDict
, we should passFilesystemConfiguration(protocol, credentials, kwargs=kwargs, client_kwargs=client_kwargs)
so that would require changing your branch to the core and in the
FileDict
implementation change:to use:
fsspec_from_config
looks like big improvement: we can handle parametrized filesystems now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great idea. I'll try it
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix Note that fsspec instances are cacheable by default. (It is separate to the dir listings cache). Is that a problem , or will each thread get it's own cache?
It can be disabled by passing
skip_instance_cache=True
on instantiation. I can disable the cache in this PR, but it if needs testing I suggest a separate github issue.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have got this working. A question though, about the imports in verified-sources
sources/filesystem/__init__.py
I added the last two lines. But should I be importing them via
dlt.sources.<some_module>
instead?What is the reason for
dlt.sources.<some_module>
? Is it some kind of wrapper to expose code to dlt Sources?Also I no longer need to import
fsspec_filesystem
(line 2) . Do I remove it in dlt.sources.filesystem too?