Skip to content

UnicodeDecodeError in diff filename #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ishepard opened this issue Nov 9, 2020 · 3 comments · Fixed by #1082
Closed

UnicodeDecodeError in diff filename #1081

ishepard opened this issue Nov 9, 2020 · 3 comments · Fixed by #1082

Comments

@ishepard
Copy link
Contributor

ishepard commented Nov 9, 2020

Hey @Byron,
just another UnicodeDecodeError here :) this time in the filename of a diff object. Originally opened here.

This is the culprit.
You can repro with this code:

from git import Repo
r = Repo("/tmp/brackets")
commit = r.commit("8b3ae041041dfeecd059c2b19c72e76223e501d3")

diff = commit.parents[0].diff(commit, create_patch=True)

The error:

  File ".../tmp.py", line 13, in <module>
    diff = commit.parents[0].diff(commit, create_patch=True)
  File ".../venv/lib/python3.8/site-packages/git/diff.py", line 145, in diff
    index = diff_method(self.repo, proc)
  File ".../venv/lib/python3.8/site-packages/git/diff.py", line 455, in _index_from_patch_format
    index.append(Diff(repo,
  File ".../venv/lib/python3.8/site-packages/git/diff.py", line 282, in __init__
    if submodule.path == a_rawpath.decode("utf-8"):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 54: invalid continuation byte

The error is on this line. We don't decode correctly the filename.

Now, we have many fixes..which one do you prefer?
I saw on the same file you use many times the 'replace' (like here). I can open a PR with this change if you think it's a good idea.

@Byron
Copy link
Member

Byron commented Nov 9, 2020

I see, one more encoding issue indeed 😅.

A PR with a similar fix would be appreciated as at least it would be consistent. However, if there is a better way (without breaking backwards compatibility), that should certainly be considered as well. To my mind the right way to deal with this is to not actually assume any encoding but to work with bytes only. gitoxide is getting that right as you can imagine :D.

@ishepard
Copy link
Contributor Author

Indeed, working with bytes instead of unicodes would be the best..especially now that we dropped py2 support. Also, git returns bytes, I think..so instead of decoding it, we just give back bytes to the users.

Though I think this will break pretty much everything, tests and users (adding a 'b' in front of all strings of all tests might not be super complicated, but still... 😄 ) For now I will just add the "replace" flag in the deconding.

For the future this might be a cool side project.

@Byron Byron linked a pull request Nov 10, 2020 that will close this issue
@Byron Byron added this to the v3.1.12 - Bugfixes milestone Nov 10, 2020
@Byron
Copy link
Member

Byron commented Nov 10, 2020

Great, thanks for the fix.
Indeed, changing everything to bytes would be best, but probably also force quite a lot of rework of any user of GitPython, making me hesitant to consider going down that road.
If everything breaks, maybe replace GitPython with something considerably better.

@Byron Byron closed this as completed Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants