Burmese syllabification error in ngram > unnest_tokens

Question

Burmese syllabification error in ngram > unnest_tokens

alexanderbeatson opened this issue 2 years ago · comments

Following is the reproducible code.

x = data.frame (n = 1:10, txt = "နှင့် အ တူ")
x %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2) %>% glimpse ()

In Burmese, "နှင့်" is a single syllable and cannot be broken into two. But when I use unnest_tokens with the parameter ngrams , it breaks "နှင့်" into two ("နှ" and "င့်")

"နှင့်" is just an example. It actually breaks into two syllables if the second consonant with ် (asat/killer) has the suffix ့ (dot below) or း (visarga).

In case tidytext needs Burmese syllabification, I may probably help you with that. I wrote the regex for Burmese syllabification (It comes with spelling error tolerance, but can be removed).

In the meantime, is there any solution to skipping build-in Burmese syllabification? I check the collapse parameter but not working.

Julia Silge · Answer 1 · Tue Apr 26 2022 00:32:50 GMT+0800 (China Standard Time)

Hmmmm, if I am understanding you correctly, I don't think I can reproduce the problem you are reporting:

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tibble(n = 1, txt = "နှင့် အ တူ")
df %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2)
#> # A tibble: 2 × 2
#>       n ngram
#>   <dbl> <chr>
#> 1     1 နှင့် အ 
#> 2     1 အ တူ

^{Created on 2022-04-25 by the reprex package (v2.0.1)}

The default tokenizer in tidytext in tokenizers, which you can also use to see how this is happening:

library(tokenizers)
txt <- "နှင့် အ တူ"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှင့် အ" "အ တူ"

^{Created on 2022-04-25 by the reprex package (v2.0.1)}

I don't think this is a tidytext, or even an R, issue? This is probably an encoding and/or maybe a C locale issue? The tokenizers package uses some underlying C++ handling code.

Can you create a reprex (a minimal reproducible example) for your problem? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. Specifically we will want the session info to see what platform, etc, you are using.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

Alexander Beatson · Answer 2 · Sat Apr 30 2022 23:28:33 GMT+0800 (China Standard Time)

I found that we have different "နှင့်" here even though the same visual rendering. You wrote "နှင့်" - "န ှ င ့ ်", but the correct way to spell based on Burmese orthography "နှင့်" is "န ှ င ် ္". ["နှင့်" and "နှင့်" have the same visual rendering but in different orthographic orders] (Or, you might just copy and paste my string, and somehow the OS mess up). So, I will post the exact Unicode code point with reprex here soon.

Julia Silge · Answer 3 · Mon May 02 2022 01:11:06 GMT+0800 (China Standard Time)

I tried copy/pasting your "correct" characters and I still see this:

library(tokenizers)
txt <- "နှင့် အ တူ"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှင့် အ" "အ တူ"

^{Created on 2022-05-01 by the reprex package (v2.0.1)}

If you can create a reprex that more accurately captures the Unicode, that would be great. Still, this is something that isn't tidytext specific, but in general how R is handling your characters.

Alexander Beatson · Answer 4 · Wed May 04 2022 03:02:34 GMT+0800 (China Standard Time)

Sorry that I cannot use the reprex because of the known R on Windows locale issue. Hope the following code would be helpful.
library(tokenizers)
txt <- "\U1014\U103E\U1004\U103A\U1037\U0020\U1021\U0020\U1010\U1030"
tokenize_ngrams(txt, n = 2)

Julia Silge · Answer 5 · Wed May 04 2022 03:11:00 GMT+0800 (China Standard Time)

Ah great! I do now see the problem you see:

library(tokenizers)
txt <- "\U1014\U103E\U1004\U103A\U1037\U0020\U1021\U0020\U1010\U1030"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှ င့်" "င့် အ" "အ တူ"

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

I would suggest that you open an issue on the tokenizers package so you can see how this can be resolved.

In the meantime, if you have a function for tokenization that works the way you need it to, notice that the token argument can be a function. You can use it like this:

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) %>%
    unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")
#> # A tibble: 124,032 × 1
#>    word       
#>    <chr>      
#>  1 "pride"    
#>  2 "and"      
#>  3 "prejudice"
#>  4 ""         
#>  5 "by"       
#>  6 "jane"     
#>  7 "austen"   
#>  8 ""         
#>  9 ""         
#> 10 ""         
#> # … with 124,022 more rows

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

Alexander Beatson · Answer 6 · Sat May 07 2022 03:43:30 GMT+0800 (China Standard Time)

Thank you so much Julia. I'll open an issue on tokenizer package, and close this issue as this is not directly under tidytext.

github-actions · Answer 7 · Sat May 21 2022 08:09:26 GMT+0800 (China Standard Time)

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.