avoid reencoding when writing out files
jwijffels opened this issue · comments
See issue #6
- https://github.com/bnosac/word2vec/blob/master/R/word2vec.R#L141
- https://github.com/bnosac/word2vec/blob/master/R/word2vec.R#L127
Switch to using writeLines(text = x, con = filehandle_train, useBytes = TRUE)
otherwise Windows re-encodes
thank you!
but new problem appears.
v=terminology[10:11]
v
[1] "桂" "姻"
predict(model, newdata = v, type = "nearest")
$桂
term1 term2 similarity rank
1 桂 重光 0.9973573 1
2 桂 苦思 0.9973214 2
3 桂 答 0.9972725 3
4 桂 雪景 0.9972241 4
5 桂 再看 0.9972237 5
6 桂 暮 0.9971988 6
7 桂 咄 0.9971318 7
8 桂 倏 0.9971008 8
9 桂 虬 0.9970832 9
10 桂 跪 0.9970744 10
$姻
term1 term2 similarity rank
1 姻 千金 0.9959285 1
2 姻 尹 0.9956685 2
3 姻 搭 0.9956171 3
4 姻 近代 0.9955212 4
5 姻 昼寝 0.9955055 5
6 姻 服 0.9954844 6
7 姻 娇态 0.9954329 7
8 姻 尺寸 0.9954248 8
9 姻 矮 0.9954141 9
10 姻 中间 0.9954020 10
but if i use this code,result is error
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 桂
v==v. but second code is error.why?can you help me?
below is all code:
library(readr)
library(word2vec)
path="cookbooks1.txt"
y=readr::read_lines(path)
x <- txt_clean_word2vec(y, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 1) ## you need to change hyperparameters to your own
terminology <- summary(model)
example <- sample(terminology, size = 2)
class(example)
class(terminology)
v=terminology[10:11]
predict(model, newdata = v, type = "nearest")
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")
You need to put your own written words in UTF-8 encoding
hello,
code is below:
v=terminology[which(terminology=="山")]
predict(model, newdata = v, type = "nearest")
$山
term1 term2 similarity rank
1 山 水 0.9285445 1
2 山 林 0.9080331 2
3 山 仙 0.9059756 3
4 山 远 0.9023114 4
5 山 右 0.8829441 5
6 山 主 0.8819309 6
7 山 隐 0.8805258 7
8 山 宽 0.8800192 8
9 山 洪 0.8799168 9
10 山 江 0.8798564 10
According to what you said, I did a custom coding setup with the following code:
v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
reslut is below:
v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 山
Look to the help of function iconv
Do you have some papers on the use of the word2vec package? or instruction?I would like to know how this package is used in practice. Thank you.
山 is different from 山 depending on your encoding
Encoding(terminology)
v=c("山" )
Encoding(v)
vutf8 <- iconv(v, from = <put here the encoding you have your v in>, to = "UTF-8")
Encoding(vutf8 )
You can look here for details on how to use the package at: https://www.bnosac.be/index.php/blog/100-word2vec-in-r
Since version 0.4.0 of this R package, the default in word2vec.character is to use useBytes is TRUE. Which should solve this issue. Let me know if this is the case.
You can now also make a word2vec model based on a list of tokenised sentences. So that you can use your own tokeniser.
Hopefull one of the 2 solutions is a fix for your issue. as.matrix.word2vec has also argument encoding as well as predict.word2vec
Closing. Feel free to reopen in case the issue seems not solved.