Python 3 compatibility #5

kevin-brown · 2014-07-13T19:40:20Z

This is a WIP pull request to get gitdb working under Python 3. This depends on my pull request in async in order for the dependencies to be compatible.

This fixes most of the import errors that came from using the implicit relative imports that Python 2 supports. This also fixes the use of `xrange`, which has replaced `range` in Python 3. The same has happened for `izip`, which is also being aliased. The octal number syntax changed in Python 3, so we are now converting from strings using the `int` built-in function, which will produce the same output across both versions of Python.

This will be switched back when the pull request for Python 3 support has been merged into the central async repository.

If it works in Python 3.3, it should also work in Python 3.4. Considering it is the latest stable release, gitdb should be tested against it.

In Python 3, the default string type is now the Python 2 unicode strings. The unicode strings cannot be converted to a byte stream, so we have to convert it before writing to the streams.

This uses memoryview by default, which is supported in Python 3 and Python 2.7, but not Python 2.6, and falls back to the old `buffer` type in Python 2.6 and when the memoryview does not support the type, such as when mmap instaces are passed in.

This adds a `byte_ord` version of `ord` which will let `bytes` safely pass through in Python 3. `cmp` was also swapped out as it has been dropped in Python 3.

This changes the internals to use BytesIO over StringIO, which fixed a few of the failing tests in Python 3. We are only importing from `io` now, instead of the entire chain, as this is available in Python 2.6+.

There were a few things which were being reused consistently for compatibility purposes, such as the `buffer`/`memoryview` functions as well as the `izip` method which needed to be aliased for Python 3. The `buffer` function was taken from `smmap` [1] and reworked slightly to handle the optional third parameter. This also adds a compatibility file dedicated entirely to encoding issues, which seem to be the biggest problem. The main functions were taken in part from the Django project [2] and rewritten slightly because our needs are a bit more narrow. A constants file has been added to consistently handle the constants which are required for the gitdb project in the core and the tests. This is part of a greater plan to reorganize the `util.py` file included in this project. This points the async extension back at the original repository and points it to the latest commit. [1]: https://github.com/Byron/smmap/blob/1af4b42a2354acbb53c7956d647655922658fd80/smmap/util.py#L20-L26 [2]: https://github.com/django/django/blob/b8d255071ead897cf68120cd2fae7c91326ca2cc/django/utils/encoding.py

Bytes should always be returned from the streams, so the tests should be checking against byte strings instead of text strings. This also fixes the `sha_iter` as it relied on the Python 2 `iterkeys` which has been renamed to `keys` in Python 3.

kevin-brown · 2014-07-17T15:15:44Z

For the most part, the most difficult part of getting this to work with Python 3 has been encoding issues. Previously, gitdb passed data around internally switching between text and bytes at random and most of the tests use StringIO. While I can make most of this consistent by passing around bytes internally and switching to BytesIO in the tests, I don't think this is a decision I should be making. So that leaves me with a few questions:

Should we be using bytes internally and rely on users passing in byte data?
When we encounter text, should we convert it to bytes? If yes, using what encoding?
The types are internally stored as text, but they are mixed in with byte data. Should we do anything about it?

Most of the remaining failing tests are encoding issues which have been difficult to trace for some reason. A general code review (from anyone) would be appreciated so we can iron out issues early.

Byron · 2014-07-17T15:35:11Z

Hi Kevin,

Thanks for all the hard work - I will try to answer your questions based on my best judgement, which clearly won't be the ultimate truth that should be discussed if you see any problems.

Should we be using bytes internally and rely on users passing in byte data?
- If the bytes refer to data, it seems like the only sensible thing to do. Whenever something is coming into gitdb, we must assure it is bytes. This answer flows right into the next one.
When we encounter text, should we convert it to bytes? If yes, using what encoding?
- A conversion seems like the most user-friendly, yet I would be alright with hard assertions. In case you prefer to help the user, UTF-8 encoding would be what I would try. In cases where text is required, like commit messages, we should do the opposite and require unicode/py3str. The more I think about it, the more sensible it seems to assert the type and fail without even trying to decode strings, as it really is a task the user has to perform. He passes in the string, and needs to know it's encoding, forcing clients to fix their issues as well, and possibly breaking them.
The types are internally stored as text, but they are mixed in with byte data. Should we do anything about it?
- These types are also used when building objects, if I remember correctly, which will be bytes after all. Making them bytes seems to be the right thing to do.

From my experience, when a library has been made 2/3 compatible, python 2 clients will always have trouble of some kind. For that reason, I would prefer clear assertions to indicate the problem right where it happens, instead of either waiting until things explode or trying to auto-encode/decode and fail to guess the right encoding.
Maybe a mix of the latter, along with a clear custom exception, could be the best middle-ground. After all, more clients would keep running right away, whereas those who don't will see that we tried to guess the encoding, and failed, along with instructions on what to do in the client's code.

This makes it easier to deal with things internally as now everything is passed as bytes.

Byron · 2014-11-13T09:46:51Z

All your changes in this pull request have been merged into the py2n3 branch.
Thanks again for your effort, you must have brought it 90% of the way already !

Byron · 2014-11-13T18:06:17Z

The good news is that at least on OSX, there is just one test failing consistently across interpreters. On linux, as run by travis ci, there are two additional issues that I should be able to debug with a virtual machine locally.
Lets hope all the remaining issues will be resolved tomorrow.

kevin-brown added 16 commits July 13, 2014 14:21

Update async to the Python 2 / 3 compatible version

d79c655

Update smmap to a Python 3 compatible version

7216749

Automated PEP 8 fixes

1c6f4c1

Temporarily switch out async for testing

5bfa7e6

This will be switched back when the pull request for Python 3 support has been merged into the central async repository.

Test against Python 3.4

85f2b9b

If it works in Python 3.3, it should also work in Python 3.4. Considering it is the latest stable release, gitdb should be tested against it.

Convert strings to bytes for PY3

b881134

In Python 3, the default string type is now the Python 2 unicode strings. The unicode strings cannot be converted to a byte stream, so we have to convert it before writing to the streams.

Use memoryview instead of buffer

01e40b5

This uses memoryview by default, which is supported in Python 3 and Python 2.7, but not Python 2.6, and falls back to the old `buffer` type in Python 2.6 and when the memoryview does not support the type, such as when mmap instaces are passed in.

Better handling of bytes

0269405

This adds a `byte_ord` version of `ord` which will let `bytes` safely pass through in Python 3. `cmp` was also swapped out as it has been dropped in Python 3.

More bytes handling

d8405ee

Bytes for everyone!

f0b3e7b

Can't compare memoryview instances, convert to bytes

a19a169

Making a bit of progress...

087803e

This changes the internals to use BytesIO over StringIO, which fixed a few of the failing tests in Python 3. We are only importing from `io` now, instead of the entire chain, as this is available in Python 2.6+.

Fix Python 2 failures

0cf09d3

Fixed a few more encoding issues

ecdae96

Bytes should always be returned from the streams, so the tests should be checking against byte strings instead of text strings. This also fixes the `sha_iter` as it relied on the Python 2 `iterkeys` which has been renamed to `keys` in Python 3.

Convert types to bytes

c63dcac

This makes it easier to deal with things internally as now everything is passed as bytes.

kevin-brown mentioned this pull request Sep 30, 2014

Support for Python3 #9

Closed

hickford mentioned this pull request Oct 6, 2014

Python 3 support gitpython-developers/GitPython#114

Closed

Byron mentioned this pull request Nov 12, 2014

Python3 compat #13

Closed

Byron self-assigned this Nov 13, 2014

Byron added this to the v0.5.5 milestone Nov 13, 2014

Byron closed this Nov 13, 2014

kevin-brown deleted the issue_4 branch November 13, 2014 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3 compatibility #5

Python 3 compatibility #5

kevin-brown commented Jul 13, 2014

kevin-brown commented Jul 17, 2014

Byron commented Jul 17, 2014

Byron commented Nov 13, 2014

Byron commented Nov 13, 2014

Python 3 compatibility #5

Python 3 compatibility #5

Conversation

kevin-brown commented Jul 13, 2014

kevin-brown commented Jul 17, 2014

Byron commented Jul 17, 2014

Byron commented Nov 13, 2014

Byron commented Nov 13, 2014