Skip to content

Python 3 compatibility #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Python 3 compatibility #5

wants to merge 17 commits into from

Conversation

kevin-brown
Copy link
Contributor

This is a WIP pull request to get gitdb working under Python 3. This depends on my pull request in async in order for the dependencies to be compatible.

This fixes most of the import errors that came from using the
implicit relative imports that Python 2 supports.  This also fixes
the use of `xrange`, which has replaced `range` in Python 3.  The
same has happened for `izip`, which is also being aliased.

The octal number syntax changed in Python 3, so we are now
converting from strings using the `int` built-in function, which
will produce the same output across both versions of Python.
This will be switched back when the pull request for Python 3
support has been merged into the central async repository.
If it works in Python 3.3, it should also work in Python 3.4.
Considering it is the latest stable release, gitdb should be tested
against it.
In Python 3, the default string type is now the Python 2 unicode
strings.  The unicode strings cannot be converted to a byte stream,
so we have to convert it before writing to the streams.
This uses memoryview by default, which is supported in Python 3
and Python 2.7, but not Python 2.6, and falls back to the old
`buffer` type in Python 2.6 and when the memoryview does not
support the type, such as when mmap instaces are passed in.
This adds a `byte_ord` version of `ord` which will let `bytes`
safely pass through in Python 3.

`cmp` was also swapped out as it has been dropped in Python 3.
This changes the internals to use BytesIO over StringIO, which
fixed a few of the failing tests in Python 3.  We are only
importing from `io` now, instead of the entire chain, as this is
available in Python 2.6+.
There were a few things which were being reused consistently for
compatibility purposes, such as the `buffer`/`memoryview` functions
as well as the `izip` method which needed to be aliased for
Python 3.  The `buffer` function was taken from `smmap` [1] and
reworked slightly to handle the optional third parameter.

This also adds a compatibility file dedicated entirely to encoding
issues, which seem to be the biggest problem.  The main functions
were taken in part from the Django project [2] and rewritten
slightly because our needs are a bit more narrow.

A constants file has been added to consistently handle the
constants which are required for the gitdb project in the core
and the tests.  This is part of a greater plan to reorganize
the `util.py` file included in this project.

This points the async extension back at the original repository
and points it to the latest commit.

[1]: https://github.com/Byron/smmap/blob/1af4b42a2354acbb53c7956d647655922658fd80/smmap/util.py#L20-L26
[2]: https://github.com/django/django/blob/b8d255071ead897cf68120cd2fae7c91326ca2cc/django/utils/encoding.py
Bytes should always be returned from the streams, so the tests
should be checking against byte strings instead of text strings.

This also fixes the `sha_iter` as it relied on the Python 2
`iterkeys` which has been renamed to `keys` in Python 3.
@kevin-brown
Copy link
Contributor Author

For the most part, the most difficult part of getting this to work with Python 3 has been encoding issues. Previously, gitdb passed data around internally switching between text and bytes at random and most of the tests use StringIO. While I can make most of this consistent by passing around bytes internally and switching to BytesIO in the tests, I don't think this is a decision I should be making. So that leaves me with a few questions:

  • Should we be using bytes internally and rely on users passing in byte data?
  • When we encounter text, should we convert it to bytes? If yes, using what encoding?
  • The types are internally stored as text, but they are mixed in with byte data. Should we do anything about it?

Most of the remaining failing tests are encoding issues which have been difficult to trace for some reason. A general code review (from anyone) would be appreciated so we can iron out issues early.

@Byron
Copy link
Member

Byron commented Jul 17, 2014

Hi Kevin,

Thanks for all the hard work - I will try to answer your questions based on my best judgement, which clearly won't be the ultimate truth that should be discussed if you see any problems.

  • Should we be using bytes internally and rely on users passing in byte data?
    • If the bytes refer to data, it seems like the only sensible thing to do. Whenever something is coming into gitdb, we must assure it is bytes. This answer flows right into the next one.
  • When we encounter text, should we convert it to bytes? If yes, using what encoding?
    • A conversion seems like the most user-friendly, yet I would be alright with hard assertions. In case you prefer to help the user, UTF-8 encoding would be what I would try. In cases where text is required, like commit messages, we should do the opposite and require unicode/py3str. The more I think about it, the more sensible it seems to assert the type and fail without even trying to decode strings, as it really is a task the user has to perform. He passes in the string, and needs to know it's encoding, forcing clients to fix their issues as well, and possibly breaking them.
  • The types are internally stored as text, but they are mixed in with byte data. Should we do anything about it?
    • These types are also used when building objects, if I remember correctly, which will be bytes after all. Making them bytes seems to be the right thing to do.

From my experience, when a library has been made 2/3 compatible, python 2 clients will always have trouble of some kind. For that reason, I would prefer clear assertions to indicate the problem right where it happens, instead of either waiting until things explode or trying to auto-encode/decode and fail to guess the right encoding.
Maybe a mix of the latter, along with a clear custom exception, could be the best middle-ground. After all, more clients would keep running right away, whereas those who don't will see that we tried to guess the encoding, and failed, along with instructions on what to do in the client's code.

This makes it easier to deal with things internally as now
everything is passed as bytes.
@kevin-brown kevin-brown mentioned this pull request Sep 30, 2014
@Byron Byron mentioned this pull request Nov 12, 2014
@Byron Byron self-assigned this Nov 13, 2014
@Byron Byron added this to the v0.5.5 milestone Nov 13, 2014
@Byron
Copy link
Member

Byron commented Nov 13, 2014

All your changes in this pull request have been merged into the py2n3 branch.
Thanks again for your effort, you must have brought it 90% of the way already !

@Byron Byron closed this Nov 13, 2014
@Byron
Copy link
Member

Byron commented Nov 13, 2014

The good news is that at least on OSX, there is just one test failing consistently across interpreters. On linux, as run by travis ci, there are two additional issues that I should be able to debug with a virtual machine locally.
Lets hope all the remaining issues will be resolved tomorrow.

@kevin-brown kevin-brown deleted the issue_4 branch November 13, 2014 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants