pytest-dev / iniconfig

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

strip unicode whitespace

gaborbernat opened this issue · comments

At the moment the parser uses the built-in string strip function (https://github.com/RonnyPfannschmidt/iniconfig/blob/master/iniconfig.py#L134). Nowadays is more and more common that people inadvertently introduce Unicode whitespaces in their configuration. The string strip function does not remove this. When we detect such we should at least warn (especially on the left-hand side), or fail. Failing to strip such characters is never the desired effect, as it effectively generates a new key that visually seems the same leaving the user in confusion of why his config does not work.

Note the py package uses this package vendored, and via that pytest/tox too (https://github.com/pytest-dev/py/blob/master/py/_vendored_packages/iniconfig.py).

Issues generated by this omission:

Detecting unicode spaces https://stackoverflow.com/questions/8921365/in-python-how-to-list-all-characters-matched-by-posix-extended-regex-space/37903375#37903375

as is iniconfig is not unicode aware, so this one is a really tricky one to fix
its def not on my own roadmap

@RonnyPfannschmidt what about doing strip as (in case the content read from the file is of type unicode):

>>> from unicodedata import name
>>> import sys
>>> import re
>>> spaces = u''.join(re.findall(r'\s', u''.join(unichr(c) for c in xrange(sys.maxunicode+1)), re.UNICODE))
>>> 'envlist '.strip(spaces)
u'envlist'
>>> name(u'\xA0')
'NO-BREAK SPACE
>>> u'envlist\xA0'.strip(spaces)
u'envlist'
>>> 

An alternative solution would be to try to avoid such configs by either warning when such lines are detected.

i am fine with any consistent solution - i just don't dont have the time/motivation to do the complete dance of ensuring its consistent

right now iniconfig is not unicode aware and works in terms of native strings and ascii whitespace

note that your example generates the spaces in a pretty expensive manner

@gaborbernat with the new release unicode support is in but unicode whitespace is not yet considered