juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:

Home Page:https://juliasilge.github.io/tidytext/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Burmese syllabification error in ngram > unnest_tokens

alexanderbeatson opened this issue · comments

Following is the reproducible code.

x = data.frame (n = 1:10, txt = "နှင့် အ တူ")
x %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2) %>% glimpse ()

In Burmese, "နှင့်" is a single syllable and cannot be broken into two. But when I use unnest_tokens with the parameter ngrams , it breaks "နှင့်" into two ("နှ" and "င့်")

"နှင့်" is just an example. It actually breaks into two syllables if the second consonant with ် (asat/killer) has the suffix ့ (dot below) or း (visarga).

In case tidytext needs Burmese syllabification, I may probably help you with that. I wrote the regex for Burmese syllabification (It comes with spelling error tolerance, but can be removed).

In the meantime, is there any solution to skipping build-in Burmese syllabification? I check the collapse parameter but not working.

Hmmmm, if I am understanding you correctly, I don't think I can reproduce the problem you are reporting:

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tibble(n = 1, txt = "နှင့် အ တူ")
df %>% unnest_tokens(ngram, txt, token = "ngrams", n = 2)
#> # A tibble: 2 × 2
#>       n ngram
#>   <dbl> <chr>
#> 1     1 နှင့် အ 
#> 2     1 အ တူ

Created on 2022-04-25 by the reprex package (v2.0.1)

The default tokenizer in tidytext in tokenizers, which you can also use to see how this is happening:

library(tokenizers)
txt <- "နှင့် အ တူ"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှင့် အ" "အ တူ"

Created on 2022-04-25 by the reprex package (v2.0.1)

I don't think this is a tidytext, or even an R, issue? This is probably an encoding and/or maybe a C locale issue? The tokenizers package uses some underlying C++ handling code.

Can you create a reprex (a minimal reproducible example) for your problem? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. Specifically we will want the session info to see what platform, etc, you are using.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

I found that we have different "နှင့်" here even though the same visual rendering. You wrote "နှင့်" - "န ှ င ့ ်", but the correct way to spell based on Burmese orthography "နှင့်" is "န ှ င ် ္". ["နှင့်" and "နှင့်" have the same visual rendering but in different orthographic orders] (Or, you might just copy and paste my string, and somehow the OS mess up). So, I will post the exact Unicode code point with reprex here soon.

I tried copy/pasting your "correct" characters and I still see this:

library(tokenizers)
txt <- "နှင့် အ တူ"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှင့် အ" "အ တူ"

Created on 2022-05-01 by the reprex package (v2.0.1)

If you can create a reprex that more accurately captures the Unicode, that would be great. Still, this is something that isn't tidytext specific, but in general how R is handling your characters.

Sorry that I cannot use the reprex because of the known R on Windows locale issue. Hope the following code would be helpful.
library(tokenizers)
txt <- "\U1014\U103E\U1004\U103A\U1037\U0020\U1021\U0020\U1010\U1030"
tokenize_ngrams(txt, n = 2)

Ah great! I do now see the problem you see:

library(tokenizers)
txt <- "\U1014\U103E\U1004\U103A\U1037\U0020\U1021\U0020\U1010\U1030"
tokenize_ngrams(txt, n = 2)
#> [[1]]
#> [1] "နှ င့်" "င့် အ" "အ တူ"

Created on 2022-05-03 by the reprex package (v2.0.1)

I would suggest that you open an issue on the tokenizers package so you can see how this can be resolved.

In the meantime, if you have a function for tokenization that works the way you need it to, notice that the token argument can be a function. You can use it like this:

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) %>%
    unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")
#> # A tibble: 124,032 × 1
#>    word       
#>    <chr>      
#>  1 "pride"    
#>  2 "and"      
#>  3 "prejudice"
#>  4 ""         
#>  5 "by"       
#>  6 "jane"     
#>  7 "austen"   
#>  8 ""         
#>  9 ""         
#> 10 ""         
#> # … with 124,022 more rows

Created on 2022-05-03 by the reprex package (v2.0.1)

Thank you so much Julia. I'll open an issue on tokenizer package, and close this issue as this is not directly under tidytext.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.