kebabCase return empty string on Unicode strings
mervynlee94 opened this issue · comments
Lodash from Javascript has kebabCase that returns both ASCII and Unicode strings as expected. For pydash, the Unicode string as input will return an empty string for many functions such as kebabCase, upperCase, lowerCase, etc. I looked into the source code and realize this is due to not considering Unicode in the regex. Am I able to contribute to this project? Or anyone is able to implement that. Thanks!
Contributions are welcome!
Feel free to submit a PR to fix this. It would be most appreciated! 👍
@mervynlee94 BTW, do you have an example to illustrate the issue? Also, what result do you get when you run the same with Lodash?
Hi @dgilland, sorry for the late reply. Sure, one of the examples is "你好,世界", which is the translation of "Hello, World" in simplified Chinese. I am expected to get a result if 你好-世界 with kebabCase function, but get an empty string instead.
Here's the lodash implementation:
- https://github.com/lodash/lodash/blob/e0029485ab4d97adea0cb34292afb6700309cf16/words.js#L32
- https://github.com/lodash/lodash/blob/e0029485ab4d97adea0cb34292afb6700309cf16/.internal/unicodeWords.js
Looks like they have a special Unicode version of the word splitter. Seems reasonable to implement something similar in pydash to achieve the desired results.
Yes, I agreed. I will try to spare some time to implement it.
Referencing from Lodash implementation, the RE_WORDS in existing pydash is different from lodash, resulting in few failed test cases.
Example of failing test case:
case = 'enable 24h format', expected = ['enable', '24', 'h', 'format']
@parametrize(
"case,expected",
[ # noqa
("hello world!", ["hello", "world"]), # noqa
("hello_world", ["hello", "world"]),
("hello!@#$%^&*()_+{}|:\"<>?-=[]\\;\\,.'/world", ["hello", "world"]),
("hello 12345 world", ["hello", "12345", "world"]),
("enable 24h format", ["enable", "24", "h", "format"]),
("tooLegit2Quit", ["too", "Legit", "2", "Quit"]),
("xhr2Request", ["xhr", "2", "Request"]),
(" ", []),
("", []),
(None, []),
],
)
def test_words(case, expected):
> assert _.words(case) == expected
E AssertionError: assert ['enable', '24h', 'format'] == ['enable', '2...'h', 'format']
E At index 1 diff: '24h' != '24'
E Right contains one more item: 'format'
E Full diff:
E - ['enable', '24', 'h', 'format']
E ? ----
E + ['enable', '24h', 'format']
tests/test_strings.py:1189: AssertionError
It is due to the RE_WORDS
Existing:
UPPER = "[A-Z\\xC0-\\xD6\\xD8-\\xDE]"
LOWER = "[a-z\\xDf-\\xF6\\xF8-\\xFF]+"
RE_WORDS = "/{upper}+(?={upper}{lower})|{upper}?{lower}|{upper}+|[0-9]+/g".format(
upper=UPPER, lower=LOWER
)
Lodash's implementation
RE_WORDS = "/[^\x00-\x2f\x3a-\x40\x5b-\x60\x7b-\x7f]+/g"
Which one should we follow?
We should follow lodash's version. The version in pydash is somewhat out of date and hasn't been updated in awhile. So any failing test cases should be updated to match the new regex behavior that's in the latest lodash.
Noted. I have implemented unicode and passed all the existing and new test cases in Python 3 versions. There are a lot of issues in Python 2 as they have different representation and implementation in strings and regex representation for unicode. Should we support Python 2.7?
Implementation repo
https://github.com/mervynlee94/pydash/tree/unicode-ver-for-word-splitter
What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?
What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?
Yes it is just getting the test cases represented correctly by the way ~
Here is some of the examples for the error in py27. The rest of the version passed all the test cases.
_________________________________________________________________________________________________ test_separator_case[case10-foo_bar_baz] _________________________________________________________________________________________________
case = ('--foo.bar;baz', '_'), expected = 'foo_bar_baz'
@parametrize(
"case,expected",
[
(("foo bar baz", "-"), "foo-bar-baz"),
(("foo__bar_baz", "-"), "foo-bar-baz"),
(("foo-_bar-_-baz", "-"), "foo-bar-baz"),
(("foo!bar,baz", "-"), "foo-bar-baz"),
(("--foo.bar;baz", "-"), "foo-bar-baz"),
(("Foo Bar", "-"), "foo-bar"),
(("foo bar baz", "_"), "foo_bar_baz"),
(("foo__bar_baz", "_"), "foo_bar_baz"),
(("foo-_bar-_-baz", "_"), "foo_bar_baz"),
(("foo!bar,baz", "_"), "foo_bar_baz"),
(("--foo.bar;baz", "_"), "foo_bar_baz"),
(("Foo Bar", "_"), "foo_bar"),
(("", "_"), ""),
((None, "_"), ""),
],
)
def test_separator_case(case, expected):
> assert _.separator_case(*case) == expected
E AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E - foo_bar_;b_az
E ? - -
E + foo_bar_baz
tests/test_strings.py:718: AssertionError
________________________________________________________________________________________________ test_snake_case[foo__bar_baz-foo_bar_baz] ________________________________________________________________________________________________
case = 'foo__bar_baz', expected = 'foo_bar_baz'
@parametrize(
"case,expected",
[
("foo bar baz", "foo_bar_baz"),
("foo__bar_baz", "foo_bar_baz"),
("foo-_bar-_-baz", "foo_bar_baz"),
("foo!bar,baz", "foo_bar_baz"),
("--foo.bar;baz", "foo_bar_baz"),
("FooBar", "foo_bar"),
("fooBar", "foo_bar"),
("你好,世界", "你好_世界"),
("你好(世界)", "你好_世界"),
("你好(世界)", "你好(世界)"),
("你好,世界", "你好,世界"),
("", ""),
(None, ""),
(5, "5"),
],
)
def test_snake_case(case, expected):
> assert _.snake_case(case) == expected
E AssertionError: assert 'foo____bar__b_az' == 'foo_bar_baz'
E - foo____bar__b_az
E ? --- - -
E + foo_bar_baz
tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[foo-_bar-_-baz-foo_bar_baz] _______________________________________________________________________________________________
case = 'foo-_bar-_-baz', expected = 'foo_bar_baz'
@parametrize(
"case,expected",
[
("foo bar baz", "foo_bar_baz"),
("foo__bar_baz", "foo_bar_baz"),
("foo-_bar-_-baz", "foo_bar_baz"),
("foo!bar,baz", "foo_bar_baz"),
("--foo.bar;baz", "foo_bar_baz"),
("FooBar", "foo_bar"),
("fooBar", "foo_bar"),
("你好,世界", "你好_世界"),
("你好(世界)", "你好_世界"),
("你好(世界)", "你好(世界)"),
("你好,世界", "你好,世界"),
("", ""),
(None, ""),
(5, "5"),
],
)
def test_snake_case(case, expected):
> assert _.snake_case(case) == expected
E AssertionError: assert 'foo__b_ar___baz' == 'foo_bar_baz'
E - foo__b_ar___baz
E ? - - --
E + foo_bar_baz
tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[--foo.bar;baz-foo_bar_baz] ________________________________________________________________________________________________
case = '--foo.bar;baz', expected = 'foo_bar_baz'
@parametrize(
"case,expected",
[
("foo bar baz", "foo_bar_baz"),
("foo__bar_baz", "foo_bar_baz"),
("foo-_bar-_-baz", "foo_bar_baz"),
("foo!bar,baz", "foo_bar_baz"),
("--foo.bar;baz", "foo_bar_baz"),
("FooBar", "foo_bar"),
("fooBar", "foo_bar"),
("你好,世界", "你好_世界"),
("你好(世界)", "你好_世界"),
("你好(世界)", "你好(世界)"),
("你好,世界", "你好,世界"),
("", ""),
(None, ""),
(5, "5"),
],
)
def test_snake_case(case, expected):
> assert _.snake_case(case) == expected
E AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E - foo_bar_;b_az
E ? - -
E + foo_bar_baz
tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] ___________________________________________________________
case = '\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
@parametrize(
"case,expected",
[
("foo bar baz", "foo_bar_baz"),
("foo__bar_baz", "foo_bar_baz"),
("foo-_bar-_-baz", "foo_bar_baz"),
("foo!bar,baz", "foo_bar_baz"),
("--foo.bar;baz", "foo_bar_baz"),
("FooBar", "foo_bar"),
("fooBar", "foo_bar"),
("你好,世界", "你好_世界"),
("你好(世界)", "你好_世界"),
("你好(世界)", "你好(世界)"),
("你好,世界", "你好,世界"),
("", ""),
(None, ""),
(5, "5"),
],
)
def test_snake_case(case, expected):
> assert _.snake_case(case) == expected
E AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E - a_a_a_c
E + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c
tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] __________________________________________________________
case = '\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
@parametrize(
"case,expected",
[
("foo bar baz", "foo_bar_baz"),
("foo__bar_baz", "foo_bar_baz"),
("foo-_bar-_-baz", "foo_bar_baz"),
("foo!bar,baz", "foo_bar_baz"),
("--foo.bar;baz", "foo_bar_baz"),
("FooBar", "foo_bar"),
("fooBar", "foo_bar"),
("你好,世界", "你好_世界"),
("你好(世界)", "你好_世界"),
("你好(世界)", "你好(世界)"),
("你好,世界", "你好,世界"),
("", ""),
(None, ""),
(5, "5"),
],
)
def test_snake_case(case, expected):
> assert _.snake_case(case) == expected
E AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E - a_a_a_c
E + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c
After investigation, there are some strange issue on Python 2.7. I have to use u' to solve one of the issue I raised in Stackoverflow
Another is the representation of string in Python 2 and Python 3 is different in nature. In Python 3, all strings are unicode but not in Python 2. Therefore the test cases won't match as it will output with raw '\xe4', '\xe5', '\xef', '\xe4', '\xe7', '\xef'
instead of readable Unicode string.