tidyverse / stringr

A fresh approach to string manipulation in R

Home Page:https://stringr.tidyverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

str_split not splitting correctly on Unicode character

alexanderbeatson opened this issue · comments

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

... and what would be the correct result?

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"