avoid reencoding when writing out files

Question

avoid reencoding when writing out files

jwijffels opened this issue 3 years ago · comments

jwijffels commented 3 years ago

See issue #6

Switch to using writeLines(text = x, con = filehandle_train, useBytes = TRUE) otherwise Windows re-encodes

niutyut · Answer 1 · Sat Sep 11 2021 15:16:57 GMT+0800 (China Standard Time)

thank you！
but new problem appears.

v=terminology[10:11]

v
[1] "桂" "姻"
predict(model, newdata = v, type = "nearest")
$桂
term1 term2 similarity rank
1 桂重光 0.9973573 1
2 桂苦思 0.9973214 2
3 桂答 0.9972725 3
4 桂雪景 0.9972241 4
5 桂再看 0.9972237 5
6 桂暮 0.9971988 6
7 桂咄 0.9971318 7
8 桂倏 0.9971008 8
9 桂虬 0.9970832 9
10 桂跪 0.9970744 10

$姻
term1 term2 similarity rank
1 姻千金 0.9959285 1
2 姻尹 0.9956685 2
3 姻搭 0.9956171 3
4 姻近代 0.9955212 4
5 姻昼寝 0.9955055 5
6 姻服 0.9954844 6
7 姻娇态 0.9954329 7
8 姻尺寸 0.9954248 8
9 姻矮 0.9954141 9
10 姻中间 0.9954020 10

but if i use this code,result is error
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 桂

v==v. but second code is error.why?can you help me?

niutyut · Answer 2 · Sat Sep 11 2021 15:20:15 GMT+0800 (China Standard Time)

below is all code:
library(readr)
library(word2vec)
path="cookbooks1.txt"
y=readr::read_lines(path)
x <- txt_clean_word2vec(y, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 1) ## you need to change hyperparameters to your own
terminology <- summary(model)
example <- sample(terminology, size = 2)
class(example)
class(terminology)
v=terminology[10:11]
predict(model, newdata = v, type = "nearest")
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")

jwijffels · Answer 3 · Sun Sep 12 2021 14:17:36 GMT+0800 (China Standard Time)

You need to put your own written words in UTF-8 encoding

niutyut · Answer 4 · Sun Sep 12 2021 22:27:29 GMT+0800 (China Standard Time)

hello,
code is below:
v=terminology[which(terminology=="山")]
predict(model, newdata = v, type = "nearest")
$山
term1 term2 similarity rank
1 山水 0.9285445 1
2 山林 0.9080331 2
3 山仙 0.9059756 3
4 山远 0.9023114 4
5 山右 0.8829441 5
6 山主 0.8819309 6
7 山隐 0.8805258 7
8 山宽 0.8800192 8
9 山洪 0.8799168 9
10 山江 0.8798564 10

According to what you said, I did a custom coding setup with the following code:
v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
reslut is below:

v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 山

jwijffels · Answer 5 · Mon Sep 13 2021 01:18:07 GMT+0800 (China Standard Time)

Look to the help of function iconv

niutyut · Answer 6 · Mon Sep 13 2021 09:35:24 GMT+0800 (China Standard Time)

Do you have some papers on the use of the word2vec package? or instruction?I would like to know how this package is used in practice. Thank you.

jwijffels · Answer 7 · Mon Sep 13 2021 16:23:55 GMT+0800 (China Standard Time)

山 is different from 山 depending on your encoding

Encoding(terminology)
v=c("山" )
Encoding(v)
vutf8 <- iconv(v, from = <put here the encoding you have your v in>, to = "UTF-8")
Encoding(vutf8 )

You can look here for details on how to use the package at: https://www.bnosac.be/index.php/blog/100-word2vec-in-r

jwijffels · Answer 8 · Thu Oct 05 2023 22:38:22 GMT+0800 (China Standard Time)

Since version 0.4.0 of this R package, the default in word2vec.character is to use useBytes is TRUE. Which should solve this issue. Let me know if this is the case.
You can now also make a word2vec model based on a list of tokenised sentences. So that you can use your own tokeniser.
Hopefull one of the 2 solutions is a fix for your issue. as.matrix.word2vec has also argument encoding as well as predict.word2vec

Closing. Feel free to reopen in case the issue seems not solved.