Tomeriko96 / polyglotr

R package to translate text

Home Page:https://tomeriko96.github.io/polyglotr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding issue

lbajemon opened this issue · comments

Hi,

Thank you for making this package, it is very useful. However, I encountered an issue when translating from a language with special characters to English, for example from French or Arabic. The returned transcription is partly or totally unreadable :

The issue:

# in French
text_fr = "La Saône prend sa source à Vioménil dans les pré-Vosges à 405 m d'altitude. La rivière conflue avec le Rhône 473,3 km plus loin."
fr_to_en = polyglotr::google_translate(text_fr, "en", "fr")

The result is "The Saône has its source at Vioménil in the Pré-Vosges at an altitude of 405 m. The river confluences with the Rhône 473.3 km further." while I'm expecting Saône instead of Saône

# in Arabic
text_ar = "يتدفقُ النيل عبر الصحراء السودانية إلى مصر باتجاه الشمال ويمر في مدينةُ القاهرة الواقعة على دلتا النهر الكبيرة (دلتا النيل)، ثم يعبر النهر مدينتي دمياط ورشيد ويصب ..."
ar_to_en = polyglotr::google_translate(text_ar, "en", "ar")

The result is "اÙÙÙ٠عبر اÙصØراء اÙسÙداÙÙØ© Ùصر ... (دÙتا اÙÙÙÙ)Ø ±".

What I've tried:
My guess is that it is an encoding issue. I have tried to use enc2utf8(mytext) or Encoding<-(value = mytext, enc2utf8) to indicate the UTF-8 encoding but it didn't work. The encoding of the returned transaltion is "unknown" (Encoding(ar_to_en)).
Have you ever encountered this problem?

Thank you.

Desktop:

  • OS: Windows
  • R version : 4.3.1
  • RStudio version : 2023.09.0

Additional context:
I've tried to use mymemory_translate instead but I've reached the characters limit and given the supported languages, I'd prefer using google translate anyway.

Hi @lbajemon ,

Thank you for reporting this issue. I have tested the code and can confirm the problem you've described.

I'll be looking into it and will provide an update as soon as I have more information.

hi @lbajemon,

Could you try the following function:

google_translate_test <- function(text, target_language = "en", source_language = "auto") {
  is_vector <- is.vector(text) && length(text) > 1
  
  formatted_text <- urltools::url_encode(text)
  
  formatted_link <- paste0(
    "https://translate.google.com/m?tl=",
    target_language, "&sl=", source_language,
    "&q=",
    formatted_text
  )
  
  if (is_vector) {
    responses <- purrr::map(formatted_link, httr::GET)
    
    translations <- purrr::map(responses, ~ {
      translation <- httr::content(.x) %>%
        rvest::html_nodes("div.result-container") %>%
        rvest::html_text()
      
      translation <- urltools::url_decode(translation)
      translation <- gsub("\n", "", translation)
      
      translation
    })
    
    return(translations)
  } else {
    response <- httr::GET(formatted_link)
    
    translation <- httr::content(response) %>%
      rvest::html_nodes("div.result-container") %>%
      rvest::html_text()
    
    translation <- urltools::url_decode(translation)
    translation <- gsub("\n", "", translation)
    
    return(translation)
  }
}

Running your examples now returns the following:

> text_fr = "La Saône prend sa source à Vioménil dans les pré-Vosges à 405 m d'altitude. La rivière conflue avec le Rhône 473,3 km plus loin."
> fr_to_en = google_translate_test(text_fr, "en", "fr")
> fr_to_en
[1] "The Saône has its source at Vioménil in the pre-Vosges at an altitude of 405 m. The river confluences with the Rhône 473.3 km further."
> # in Arabic
> text_ar = "يتدفقُ النيل عبر الصحراء السودانية إلى مصر باتجاه الشمال ويمر في مدينةُ القاهرة الواقعة على دلتا النهر الكبيرة (دلتا النيل)، ثم يعبر النهر مدينتي دمياط ورشيد ويصب ..."
> ar_to_en = google_translate_test(text_ar, "en", "ar")
> ar_to_en
[1] "The Nile flows through the Sudanese desert to Egypt towards the north and passes through the city of Cairo, located on the large river delta (Nile Delta), then the river crosses the cities of Damietta and Rosetta and flows..."

Hi @tin900, it works with this function. Thank you very much

Awesome!

Commit 72994d2 implements the bugfix.

You can install the development version of the package to use the improved google_translate() function.

I will update the package on CRAN in the coming days as well.

The new version of the package is now available on CRAN