jazzband / pathlib2

Backport of pathlib aiming to support the full stdlib Python API.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unusable on Windows/Python2.7 with unicode paths

Suor opened this issue · comments

The issue is that sys.getfilesystemencoding() == "mcbs" on Windows/Python 2.7, which is lossy. So one looses info when passes path as unicode to Path.

I'm sorry, but I cannot reproduce your problem. Unicode paths work fine as long as they can be encoded in the underlying file system encoding. See also the advice offered at

https://www.python.org/dev/peps/pep-0529/

Here is a test script:

# encoding: utf-8
import pathlib2
import os
import sys
print 'filesystemencoding', sys.getfilesystemencoding()

test = u'тест' # This is "test" in russian in utf-8

open(test, 'w').close()
p = pathlib2.Path(test)
print 'path', p
print 'os.path.exists', os.path.exists(test)
print 'pathlib.Path.exists', p.exists()

And its output:

filesystemencoding mbcs
path ????
os.path.exists True
pathlib.Path.exists
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    print 'pathlib.Path.exists', p.exists()
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 1549, in exists
    self.stat()
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 1356, in stat
    return self._accessor.stat(self)
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 541, in wrapped
    return strfunc(str(pathobj), *args)
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: '????'

I use Python 2.7.14 on Windows 10 Dev Evaluation image. As you can see open() and os.path work fine with utf-8 path, but not pathlib2.

See also the advice offered at

What advice do you mean?

Thanks for the test case, that really helps.

Can you check the result of u'тест'.encode("mbcs") (I do not immediately have access to a windows machine so I cannot test it right now myself)? If that works, then the issue is not caused by using a path that cannot be encoded in mbcs in Windows (as you mentioned in the issue originally), but rather, it looks like for some reason the accessor functions aren't properly working.

If the encode works, can you also see if open(u'тест'.encode("mbcs"), 'w').close() works and creates the correct file?

Well, from the tests I've done so far it looks like u'тест' is not encodable by sys.filesystemencoding() on Windows. Why os.path methods work on these paths when passed as unicode is a mystery. To make things even worse, encode with mbcs does not respect the "strict" argument.

This is not advertised behaviour (see https://docs.python.org/2/howto/unicode.html#unicode-filenames ...), and it looks like Python 2.7 is broken in this respect, at least on Windows. :-(

u'тест'.encode('mbcs') returns '????', so data is lost after that.

Thanks for testing - I've seen the same on appveyor. I'm afraid this looks like a bug in Python 2.7.

Why are you encoding strings anyway? This is not done in Python 3 and is weird to see in a backport. I was quite surprised when my tests failed because Path.name is an encoded string in Python 2 - a real headache for a cross-python app.

The original authors designed pathlib to store file paths as byte strings in the file system encoding. I'm afraid I do not know why. From a user perspective, it would make much more sense for unicode to be used for filenames consistently everywhere.

The fact that for instance open(u'тест'.encode(sys.getfilesystemencoding()), 'w') is not the same as open(u'тест', 'w') seems to me to be a bug in Python 2.7. The backport was designed with the assumption that these would be equivalent.

The way to fix this is to figure out which encoding to use in place of mbcs on Windows. I'm sorry to say that I have no idea on how to do this correctly. From what I understand, the whole purpose of "mbcs" is for Python to do this correctly for us. But it doesn't...

Or you can provide a way for users to set this encoding. I would just say something like:

pathlib.setfilesystemencoding('utf-8')

and be ok with it. And then drop it in a half of the year)

Well, you can just piggy back sys.getfilesystemencoding() to return the correct encoding for your file system. For example:

import sys
sys.getfilesystemencoding = lambda: "utf16"

if your filesystem uses utf16 encoding (which, I understand, is the standard used in windows). Obviously, if you specify the wrong encoding, things will go horribly wrong, so do this at your own risk.

pathlib2 uses sys.getfilesystemencoding() to get the encoding everywhere, so this hack should work as foreseen (although I have not tested it).

Does this help?

I don't want to hack that globally. I've already worked around it by fixing my PathInfo subclass.

Sure. I'll close this as it's an upstream bug in Python 2.7 (which is fixed in Python 3.6 as per the PEP I linked above in my first reply). In retrospect, I wish I had been aware earlier of how big of a problem this was.