str_split not splitting correctly on Unicode character

Question

str_split not splitting correctly on Unicode character

alexanderbeatson opened this issue 2 months ago · comments

Alexander Beatson commented 2 months ago

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

Marek Gagolewski · Answer 1 · Tue Apr 02 2024 21:08:01 GMT+0800 (China Standard Time)

... and what would be the correct result?

Alexander Beatson · Answer 2 · Thu Apr 04 2024 14:09:55 GMT+0800 (China Standard Time)

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"