Response to ERS' "Pratical Python porting for systems programmers"

This text is a response to Eris S. Raymond and Peter A. Donis HOWTO on Python2 to Python3 porting for system programmers. (For brevity, I'll be using the phrase "ESR's howto" in the rest of this document. No offence, Peter)

Go read the original text here first.

Introduction

I'm going to talk here about my own porting experience on the project I've been working on for several years now.

It's a build framework written in Python called qiBuild.

See the documentation, the github repo and the demo on asciinema for more details.

The gist of it is that this is a program that reads XML config files, and runs some commands like git or cmake to fetch sources, configure and build some complex C++ code, so the scope is a bit different than the type of programs (written by "system programmer") that ESR is talking about in his HOWTO.

Main differences are:

We don't care about performance (the bottlenecks are the network when we do git stuff, and forking other executables when we build stuff)
We don't really care if bytes in the range 0x80..0xFF are modified, because we rarely parse binary data and, as stated above, we don't mind the overhead of encoding and decoding strings to bytes.
The tests are written in Python using pytest and have a lot of dependencies on external packages. (reposurgeon and src are not using tests written in Python, and have no dependencies outside the stdlib)

There are some similarities, though:

The software has a large suite of tests (85% line coverage for qiBuild)
The main goal is the same : we want the code to work both under Python3 and Python2.7

Where I agree

The "Why is this difficult" section is a good introduction to the real problems that occur when Porting to Python3.

The "What doesn't work" also contains solid advice.

"Make you change testable": I can't stress this enough. Without a rigorous test suite, that checks behavior on input containing non-ASCII characters, you are going to be in a lot of trouble when trying to port to Python3, and will be faced with mysterious UnicodeDecodeError or can't concat bytes to string errors.

"Fix up string/unicode mixing":

> The art here is in doing as little work as possible. Your encode() and
> decode() calls should intercept your binary I/O close to where it happens, so
> the bulk of your code is just seeing Unicode strings.
> This is also the stage at which you may need to tag some literals with a
> prefix b for byte-buffer. Beware, if you have a lot of these it may mean you
> have not put encode/decode calls near enough to the natural choke points where
> your binary I/O is happening.

All good advice. In qiBuild for instance we have a method to read output of git commands (Git.call()). Data is read from the subprocess.Popen object as bytes-buffer and is immediately encoded as UTF-8 string (Yes, UTF-8 and not Latin-1, more on this later)

Missing points

Here are some points that are not at all covered by ESR's HOWTO, but that I still find useful:

Don't target Python between 3.0 and 3.2

You won't be able to use u"foo" to prefix your Unicode string litterals, (among other things) which really is a PITA.

The only case you may want it is for old distributions, such as Ubuntu 12.04, but users of these systems can still use Python2.7 since you're writing Python2 compatible code.

Of course, when dropping Python2 support you should also drop support for these old distributions too.

Fixing print

I find it odd that ESR's howto does not mention it, it's the most known problem when switching to Python3

If you use 2to3, code will be converted from:

print foo, bar

print(foo, bar)

The problem is that for Python2, this statement means "Print the tuple (foo, bar)", so the result is not what you expect.

The fix is simple, just add

from __future__ import print_function

before any of your imports

Fixing division

ESR does not talk about the case when you really want a float result, even when both arguments are ints.

There's a way to fix this too.

Use:

from __future__ import division

This makes it possible to avoid using things like:

i_really_need_a_float = float(a) / b

Of course, if you really need truncating divison, use //

Fixing exceptions testing

In qiBuild I use a lot of exceptions, and thus a lot of tests are checking exceptions for their message.

When you have a exception derived from the basic Exception class, you should make sure when porting to Python3, to not use the message member, but the args member:

# Fails on Python3: Exception has no attribute named 'message'
with pytest.raises(MyException) as e:
    test_something_that_should_throw()
assert "something" in e.message

# Works both for Python3 and Python2

with pytest.raises(MyException) as e:
    test_something_that_should_throw()
assert "something" in e.args[0]

Dictionaries views

Let's say you have some code like this.

my_dict = { "a" : 1 }
keys = my_dict.keys()

By default, when you run 2to3, your code will be changed to:

my_dict = { "a" : 1 }
keys = list(my_dict.keys())

This is because in Python3, keys() returns a dictionary view, which is different from the list you get in Python2, and is also different from the iterator you get with iterkeys() on Python2

But in most cases, you just want to iterate over the keys, so I recommend using 2to3 with --nofix=dict.

Be careful though, code will blow up if you have something like:

my_dict = { "a" : 1 }
keys = my_dict.keys()
keys.sort()

That's because dictionary views do not have a sort() method.

Instead, write something like:

my_dict = { "a" : 1 }
keys = my_dict.keys()
keys = sorted(keys)

An other gotcha is when you change the dictionary:

for key in my_dict.keys():
    if something(key):
        del my_dict[key]

Here there's no choice but converting to a list:

for key in list(my_dict.keys()):
    if something(key):
        del my_dict[key]

A note on continuous integration

On both reposurgeon and src, the port to Python3 was done while no other development was done. On qiBuild, the development continued without waiting for the Python3 port to be over and merged, so the port had to be done on an other branch. (I called it 'six')

So, how to cope with that?

Well, use continuous integration. In my case I'm using Jenkins.

Whenever a commit is merged on the development branch, the following happens:

The 'six' branch gets rebased
The test suite is ran both for Python2 and Python3
The branch gets "pushed forced" to the main repository.

If any of this steps go wrong (for instance, the rebase failed because of conflicts, or one of the test suite failed), a mail is sent and appropriate action can be taken.

This means the 'six' branch continues to be "alive" and can be trivially and safely merged to the development branch when ready.

Where I don't agree

Come on, I know this is the part you've all been waiting for :)

A little disclaimer first.

These are my own opinions, and your mileage may vary. I'm not saying that ESR is wrong, I'm just offering an other point of view on a topic I care about, based on my own experience on a somewhat similar project.

About the steps

Here are the steps ESR recommends:

Run 2to3 and apply the patch it generates
Partially revert it to make sure it still runs under Python2
Change the shebangs to be #!/usr/bin/env python3
Fix Python3 issues
Change the shebangs again to be #!/usr/bin/env python
Tweak the test suite to run twice, once for Python2 and once for Python3

The steps I've followed are a bit different:

Run 2to3 and apply the patch it generates (no changes here)
Make the whole test suite pass on Python3
Then make the whole test suite pass on Python2. But this time, instead of manually writing compatibly code, I used the excellent six library. (More on six later)
When Python2 test suite passes again, check with Python3
Last step is the same: make sure the test suite runs twice, once for Python2 and once for Python3. I recommend using tox for this, especially if you are using Jenkins to run your test suite.

Note that if we wish to drop Python2 compatibily, all we have to do is revert the patch that uses six

Also, there's no need to manually amend the patch generated by 2to3, which means it's easy to redo the port once the changes are rebased (see above)

About the porting itself

Why `six` ?

reposurgeon and src do not use six to help Python3 porting, probably because the author did not want to depend on anything other than the stdlib.

In qiBuild we already depend on third-party libraries, so adding an other one was no being deal.

Also, six is the choice for a lot of projects that wish to achieve Python2/Python3 compatibility with the same code base (Sphinx and Django, to only name a few)

I also thinks that using six leads to cleaner code.

It takes care of libraries whose name changed, so you can write
```
from six.moves import input
```
and then use input everywhere, instead of
```
input = raw_input
except NameError:
    my_input = input
```
which looks like a hack to me.

Same thing for import changes:

from six.move import configparser

Instead of:

try:
    import configparser
except ImportError:
    import ConfigParser as configparser

Lastly, it's the best way I know to handle code that use metaclasses while keeping a syntax compatible with Python2 and Python3

Alternatives to six

Here are two alternatives I found, unfortunately after the port to Python3 started...

I did not use them so I can't really comment on them. They seem to be far less used than six though.

pies is an alternative to six you may want to consider. See pie's README on github for the details.
python-future is also interesting, since it contains tools that contrary to 2to3, will generate Python2/Python3 compatible code directly.

Use `UTF-8` everywhere

I chose to always encode in UTF-8 instead of Latin-1.

Rationale:

UTF-8 has become the 'standard' when it comes to encoding, and can handle things than Latin-1 can't.
We do a lot of XML parsing and writing, and UTF-8 is the default encoding for XML
As stated above, we don't care about the high-byte-preserving stuff since we don't write binary data.

Re-assigning `sys.stdout` and `sys.stderr`

I also don't recommend the trick that re-assigns sys.stdout and sys.stdin to use io.TextWrapper. Instead, make sure that your string is UTF-8 encoded before sending it to sys.stdout or sys.stderr.

If you have to mock sys.stdout in your tests, do something like:

@pytest.fixture
def stdout_wrapper():
    if six.PY3:
        return io.StringIO()
    else:
        return io.BytesIO()

def test_something(stdout_wrapper):
    something_that_writes_to_stdout()
    assert stdout_wrapper.getvalue() == "42"

Messing with shebangs

There is a better way, that may seem overkill for single-file projects like reposurgeon or src, but is quite handy for a project like qiBuild which has a bunch of command-line scripts (qibuild, qisrc, and so on)

Write a setup.py and declare an entry point (Usually a main method from one of your modules):

# setup.py

from setuptools import setup
# Yes, you need setuptools and not distutils

setup(
    name = "foo",
    py_modules=["foo"],
    entry_points = {
      "console_scripts" : [
          "foo = "foo:main",
      ]
    }
)

Create two virtualenvs, one for each version of Python

mkdir -p ~/.venvs
virtualenv-2 ~/.venvs/foo-py2
virtualenv-3 ~/.venvs/foo-py3

Then in both env, run pip install --editable . from the sources of your porject:

source ~/.venvs/foo-py2/bin/activate
pip install --editable .
deactivate # exit the virtualenv for Python2
source ~/.venvs/foo-py3/bin/activate
pip install --editable .

Done. setuptools will generate a foo script with the correct shebang in both virtualenvs that gets inserted into your PATH when you switch virtualenvs when sourcing the activate script.

For extra convenience you can use virtualenvwrapper to quickly switch from one virtualenv to an other.

Messing with xrange

I also disagree with the following snippet:

try:
    xrange
except NameError:
    xrange = range

I think it's a bad idea to use a deprecated name in the code. Remember, even if the goal is to be Python2/Python3 compatible, you are going to drop Python2 support at some point ....

As expected, six has a solution:

import six

my_iterator = six.moves.range()

Note that I personally prefer using the built-in range() everywhere. There will be a small performance cost on Python2, of course, but I'm fine with it.

An other note, by default 2to3 will convert code looking like

r = range(0, 1)

r = list(range(0, 1))

I think this is a bad idea. It's very rare to do something other than iterating over a range.

You can use 2to3 with --nofix range to prevent this change from being automatically performed.

Conclusion

Thanks to ESR for giving me the idea of writing my own porting guide, it was a fun exercise.

I've left a comment in his blog post, discussion can continues on his blog.

If you are curious, the six branch is available on my personal fork on github, but please don't use it as history on this branch is frequently rewritten.

Also, note that there is just one big commit where all the porting happens.

Initially there was one per step, but it's more convenient to have them squashed when rebasing.