dgilland / pydash

The kitchen sink of Python utility libraries for doing "stuff" in a functional way. Based on the Lo-Dash Javascript library.

Home Page:http://pydash.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kebabCase return empty string on Unicode strings

mervynlee94 opened this issue · comments

Lodash from Javascript has kebabCase that returns both ASCII and Unicode strings as expected. For pydash, the Unicode string as input will return an empty string for many functions such as kebabCase, upperCase, lowerCase, etc. I looked into the source code and realize this is due to not considering Unicode in the regex. Am I able to contribute to this project? Or anyone is able to implement that. Thanks!

Contributions are welcome!

Feel free to submit a PR to fix this. It would be most appreciated! 👍

@mervynlee94 BTW, do you have an example to illustrate the issue? Also, what result do you get when you run the same with Lodash?

Hi @dgilland, sorry for the late reply. Sure, one of the examples is "你好,世界", which is the translation of "Hello, World" in simplified Chinese. I am expected to get a result if 你好-世界 with kebabCase function, but get an empty string instead.

Here's the lodash implementation:

Looks like they have a special Unicode version of the word splitter. Seems reasonable to implement something similar in pydash to achieve the desired results.

Yes, I agreed. I will try to spare some time to implement it.

@dgilland

Referencing from Lodash implementation, the RE_WORDS in existing pydash is different from lodash, resulting in few failed test cases.

Example of failing test case:

case = 'enable 24h format', expected = ['enable', '24', 'h', 'format']

    @parametrize(
        "case,expected",
        [  # noqa
            ("hello world!", ["hello", "world"]),  # noqa
            ("hello_world", ["hello", "world"]),
            ("hello!@#$%^&*()_+{}|:\"<>?-=[]\\;\\,.'/world", ["hello", "world"]),
            ("hello 12345 world", ["hello", "12345", "world"]),
            ("enable 24h format", ["enable", "24", "h", "format"]),
            ("tooLegit2Quit", ["too", "Legit", "2", "Quit"]),
            ("xhr2Request", ["xhr", "2", "Request"]),
            (" ", []),
            ("", []),
            (None, []),
        ],
    )
    def test_words(case, expected):
>       assert _.words(case) == expected
E       AssertionError: assert ['enable', '24h', 'format'] == ['enable', '2...'h', 'format']
E         At index 1 diff: '24h' != '24'
E         Right contains one more item: 'format'
E         Full diff:
E         - ['enable', '24', 'h', 'format']
E         ?               ----
E         + ['enable', '24h', 'format']

tests/test_strings.py:1189: AssertionError


It is due to the RE_WORDS
Existing:

UPPER = "[A-Z\\xC0-\\xD6\\xD8-\\xDE]"
LOWER = "[a-z\\xDf-\\xF6\\xF8-\\xFF]+"
RE_WORDS = "/{upper}+(?={upper}{lower})|{upper}?{lower}|{upper}+|[0-9]+/g".format(
    upper=UPPER, lower=LOWER
)

Lodash's implementation
RE_WORDS = "/[^\x00-\x2f\x3a-\x40\x5b-\x60\x7b-\x7f]+/g"

Which one should we follow?

We should follow lodash's version. The version in pydash is somewhat out of date and hasn't been updated in awhile. So any failing test cases should be updated to match the new regex behavior that's in the latest lodash.

Noted. I have implemented unicode and passed all the existing and new test cases in Python 3 versions. There are a lot of issues in Python 2 as they have different representation and implementation in strings and regex representation for unicode. Should we support Python 2.7?

Implementation repo
https://github.com/mervynlee94/pydash/tree/unicode-ver-for-word-splitter

What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?

What types of errors are happening on Python 2 that aren't on 3? Is it just getting the test cases represented correctly or is there behavior difference in the function itself?

Yes it is just getting the test cases represented correctly by the way ~
Here is some of the examples for the error in py27. The rest of the version passed all the test cases.

_________________________________________________________________________________________________ test_separator_case[case10-foo_bar_baz] _________________________________________________________________________________________________

case = ('--foo.bar;baz', '_'), expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            (("foo  bar baz", "-"), "foo-bar-baz"),
            (("foo__bar_baz", "-"), "foo-bar-baz"),
            (("foo-_bar-_-baz", "-"), "foo-bar-baz"),
            (("foo!bar,baz", "-"), "foo-bar-baz"),
            (("--foo.bar;baz", "-"), "foo-bar-baz"),
            (("Foo Bar", "-"), "foo-bar"),
            (("foo  bar baz", "_"), "foo_bar_baz"),
            (("foo__bar_baz", "_"), "foo_bar_baz"),
            (("foo-_bar-_-baz", "_"), "foo_bar_baz"),
            (("foo!bar,baz", "_"), "foo_bar_baz"),
            (("--foo.bar;baz", "_"), "foo_bar_baz"),
            (("Foo Bar", "_"), "foo_bar"),
            (("", "_"), ""),
            ((None, "_"), ""),
        ],
    )
    def test_separator_case(case, expected):
>       assert _.separator_case(*case) == expected
E       AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E         - foo_bar_;b_az
E         ?         - -
E         + foo_bar_baz

tests/test_strings.py:718: AssertionError
________________________________________________________________________________________________ test_snake_case[foo__bar_baz-foo_bar_baz] ________________________________________________________________________________________________

case = 'foo__bar_baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好(世界)", "你好(世界)"),
            ("你好,世界", "你好,世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo____bar__b_az' == 'foo_bar_baz'
E         - foo____bar__b_az
E         ?    ---     - -
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[foo-_bar-_-baz-foo_bar_baz] _______________________________________________________________________________________________

case = 'foo-_bar-_-baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好(世界)", "你好(世界)"),
            ("你好,世界", "你好,世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo__b_ar___baz' == 'foo_bar_baz'
E         - foo__b_ar___baz
E         ?     - -  --
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
_______________________________________________________________________________________________ test_snake_case[--foo.bar;baz-foo_bar_baz] ________________________________________________________________________________________________

case = '--foo.bar;baz', expected = 'foo_bar_baz'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好(世界)", "你好(世界)"),
            ("你好,世界", "你好,世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'foo_bar_;b_az' == 'foo_bar_baz'
E         - foo_bar_;b_az
E         ?         - -
E         + foo_bar_baz

tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] ___________________________________________________________

case = '\xe4\xbd\xa0\xe5\xa5\xbd,\xe4\xb8\x96\xe7\x95\x8c', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好(世界)", "你好(世界)"),
            ("你好,世界", "你好,世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E         - a_a_a_c
E         + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c

tests/test_strings.py:806: AssertionError
__________________________________________________________ test_snake_case[\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)-\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c] __________________________________________________________

case = '\xe4\xbd\xa0\xe5\xa5\xbd(\xe4\xb8\x96\xe7\x95\x8c)', expected = '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'

    @parametrize(
        "case,expected",
        [
            ("foo  bar baz", "foo_bar_baz"),
            ("foo__bar_baz", "foo_bar_baz"),
            ("foo-_bar-_-baz", "foo_bar_baz"),
            ("foo!bar,baz", "foo_bar_baz"),
            ("--foo.bar;baz", "foo_bar_baz"),
            ("FooBar", "foo_bar"),
            ("fooBar", "foo_bar"),
            ("你好,世界", "你好_世界"),
            ("你好(世界)", "你好_世界"),
            ("你好(世界)", "你好(世界)"),
            ("你好,世界", "你好,世界"),
            ("", ""),
            (None, ""),
            (5, "5"),
        ],
    )
    def test_snake_case(case, expected):
>       assert _.snake_case(case) == expected
E       AssertionError: assert 'a_a_a_c' == '\xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c'
E         - a_a_a_c
E         + \xe4\xbd\xa0\xe5\xa5\xbd_\xe4\xb8\x96\xe7\x95\x8c


After investigation, there are some strange issue on Python 2.7. I have to use u' to solve one of the issue I raised in Stackoverflow

Another is the representation of string in Python 2 and Python 3 is different in nature. In Python 3, all strings are unicode but not in Python 2. Therefore the test cases won't match as it will output with raw '\xe4', '\xe5', '\xef', '\xe4', '\xe7', '\xef' instead of readable Unicode string.