str_split not splitting correctly on Unicode character
alexanderbeatson opened this issue · comments
I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.
str_split("စမ်းသပ်မှု", "")[[1]]
it returns:
[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"
If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]]
it returns character level:
[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"
I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:
[1] "စမ်း" "သပ်" "မှု"
So, I don't think it is actually a feature like Issue:88
For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.
... and what would be the correct result?
Correct return should be:
[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"