kebabCase return empty string on Unicode strings

Question

kebabCase return empty string on Unicode strings

mervynlee94 opened this issue 4 years ago · comments

Lodash from Javascript has kebabCase that returns both ASCII and Unicode strings as expected. For pydash, the Unicode string as input will return an empty string for many functions such as kebabCase, upperCase, lowerCase, etc. I looked into the source code and realize this is due to not considering Unicode in the regex. Am I able to contribute to this project? Or anyone is able to implement that. Thanks!

Derrick Gilland · Answer 1 · Tue Sep 01 2020 05:38:22 GMT+0800 (China Standard Time)

Contributions are welcome!

Feel free to submit a PR to fix this. It would be most appreciated! 👍

Derrick Gilland · Answer 2 · Wed Sep 02 2020 02:22:23 GMT+0800 (China Standard Time)

@mervynlee94 BTW, do you have an example to illustrate the issue? Also, what result do you get when you run the same with Lodash?

Mervyn Lee · Answer 3 · Thu Oct 29 2020 16:53:36 GMT+0800 (China Standard Time)

Hi @dgilland, sorry for the late reply. Sure, one of the examples is "你好，世界", which is the translation of "Hello, World" in simplified Chinese. I am expected to get a result if 你好-世界 with kebabCase function, but get an empty string instead.

Derrick Gilland · Answer 4 · Fri Oct 30 2020 10:07:51 GMT+0800 (China Standard Time)

Here's the lodash implementation:

Looks like they have a special Unicode version of the word splitter. Seems reasonable to implement something similar in pydash to achieve the desired results.

Mervyn Lee · Answer 5 · Fri Oct 30 2020 10:24:45 GMT+0800 (China Standard Time)

Yes, I agreed. I will try to spare some time to implement it.

Mervyn Lee · Answer 6 · Sat Oct 31 2020 19:08:45 GMT+0800 (China Standard Time)

@dgilland

Referencing from Lodash implementation, the RE_WORDS in existing pydash is different from lodash, resulting in few failed test cases.

Example of failing test case:

case = 'enable 24h format', expected = ['enable', '24', 'h', 'format']

    @parametrize(
        "case,expected",
        [  # noqa
            ("hello world!", ["hello", "world"]),  # noqa
            ("hello_world", ["hello", "world"]),
            ("hello!@#$%^&*()_+{}|:\"<>?-=[]\\;\\,.'/world", ["hello", "world"]),
            ("hello 12345 world", ["hello", "12345", "world"]),
            ("enable 24h format", ["enable", "24", "h", "format"]),
            ("tooLegit2Quit", ["too", "Legit", "2", "Quit"]),
            ("xhr2Request", ["xhr", "2", "Request"]),
            (" ", []),
            ("", []),
            (None, []),
        ],
    )
    def test_words(case, expected):
>       assert _.words(case) == expected
E       AssertionError: assert ['enable', '24h', 'format'] == ['enable', '2...'h', 'format']
E         At index 1 diff: '24h' != '24'
E         Right contains one more item: 'format'
E         Full diff:
E         - ['enable', '24', 'h', 'format']
E         ?               ----
E         + ['enable', '24h', 'format']

tests/test_strings.py:1189: AssertionError

It is due to the RE_WORDS
Existing:

UPPER = "[A-Z\\xC0-\\xD6\\xD8-\\xDE]"
LOWER = "[a-z\\xDf-\\xF6\\xF8-\\xFF]+"
RE_WORDS = "/{upper}+(?={upper}{lower})|{upper}?{lower}|{upper}+|[0-9]+/g".format(
    upper=UPPER, lower=LOWER
)

Lodash's implementation
RE_WORDS = "/[^\x00-\x2f\x3a-\x40\x5b-\x60\x7b-\x7f]+/g"

Which one should we follow?

Derrick Gilland · Answer 7 · Sun Nov 01 2020 03:09:12 GMT+0800 (China Standard Time)

We should follow lodash's version. The version in pydash is somewhat out of date and hasn't been updated in awhile. So any failing test cases should be updated to match the new regex behavior that's in the latest lodash.

Mervyn Lee · Answer 8 · Sun Nov 01 2020 15:37:55 GMT+0800 (China Standard Time)

Noted. I have implemented unicode and passed all the existing and new test cases in Python 3 versions. There are a lot of issues in Python 2 as they have different representation and implementation in strings and regex representation for unicode. Should we support Python 2.7?

Implementation repo
https://github.com/mervynlee94/pydash/tree/unicode-ver-for-word-splitter

Derrick Gilland · Answer 9 · Mon Nov 02 2020 06:42:42 GMT+0800 (China Standard Time)

What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?

Mervyn Lee · Answer 10 · Fri Jan 01 2021 13:21:51 GMT+0800 (China Standard Time)

What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?

Yes it is just getting the test cases represented correctly by the way ~
Here is some of the examples for the error in py27. The rest of the version passed all the test cases.

_________________________________________________________________________________________________ test_separator_case[case10-foo_bar_baz] _________________________________________________________________________________________________

case = ('--foo.bar;baz', '_'), expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            (("foo  bar baz", "-"), "foo-bar-baz"),
            (("foo__bar_baz", "-"), "foo-bar-baz"),
            (("foo-_bar-_-baz", "-"), "foo-bar-baz"),
            (("foo!bar,baz", "-"), "foo-bar-baz"),
            (("--foo.bar;baz", "-"), "foo-bar-baz"),
            (("Foo Bar", "-"), "foo-bar"),
            (("foo  bar baz", "_"), "foo_bar_baz"),
            (("foo__bar_baz", "_"), "foo_bar_baz"),
            (("foo-_bar-_-baz", "_"), "foo_bar_baz"),
            (("foo!bar,baz", "_"), "foo_bar_baz"),
            (("--foo.bar;baz", "_"), "foo_bar_baz"),
            (("Foo Bar", "_"), "foo_bar"),
            (("", "_"), ""),
            ((None, "_"), ""),
        ],
    )
    def test_separator_case(case, expected):
>       assert _.separator_case(*case) == expected
E       AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E         - foo_bar_;b_az
E         ?         - -
E         + foo_bar_baz

tests/test_strings.py:718: AssertionError
________________________________________________________________________________________________ test_snake_case[foo__bar_baz-foo_bar_baz] ________________________________________________________________________________________________

case = 'foo__bar_baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好（世界）", "你好（世界）"),
            ("你好，世界", "你好，世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo____bar__b_az' == 'foo_bar_baz'
E         - foo____bar__b_az
E         ?    ---     - -
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[foo-_bar-_-baz-foo_bar_baz] _______________________________________________________________________________________________

case = 'foo-_bar-_-baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好（世界）", "你好（世界）"),
            ("你好，世界", "你好，世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo__b_ar___baz' == 'foo_bar_baz'
E         - foo__b_ar___baz
E         ?     - -  --
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[--foo.bar;baz-foo_bar_baz] ________________________________________________________________________________________________

case = '--foo.bar;baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好（世界）", "你好（世界）"),
            ("你好，世界", "你好，世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E         - foo_bar_;b_az
E         ?         - -
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] ___________________________________________________________

case = '\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好（世界）", "你好（世界）"),
            ("你好，世界", "你好，世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E         - a_a_a_c
E         + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c

tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] __________________________________________________________

case = '\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好（世界）", "你好（世界）"),
            ("你好，世界", "你好，世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E         - a_a_a_c
E         + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c

Mervyn Lee · Answer 11 · Sat Jan 02 2021 16:40:27 GMT+0800 (China Standard Time)

After investigation, there are some strange issue on Python 2.7. I have to use u' to solve one of the issue I raised in Stackoverflow

Another is the representation of string in Python 2 and Python 3 is different in nature. In Python 3, all strings are unicode but not in Python 2. Therefore the test cases won't match as it will output with raw '\xe4', '\xe5', '\xef', '\xe4', '\xe7', '\xef' instead of readable Unicode string.