bnosac / word2vec

Distributed Representations of Words using word2vec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

avoid reencoding when writing out files

jwijffels opened this issue · comments

See issue #6

Switch to using writeLines(text = x, con = filehandle_train, useBytes = TRUE) otherwise Windows re-encodes

thank you!
but new problem appears.

v=terminology[10:11]

v
[1] "桂" "姻"
predict(model, newdata = v, type = "nearest")
$桂
term1 term2 similarity rank
1 桂 重光 0.9973573 1
2 桂 苦思 0.9973214 2
3 桂 答 0.9972725 3
4 桂 雪景 0.9972241 4
5 桂 再看 0.9972237 5
6 桂 暮 0.9971988 6
7 桂 咄 0.9971318 7
8 桂 倏 0.9971008 8
9 桂 虬 0.9970832 9
10 桂 跪 0.9970744 10

$姻
term1 term2 similarity rank
1 姻 千金 0.9959285 1
2 姻 尹 0.9956685 2
3 姻 搭 0.9956171 3
4 姻 近代 0.9955212 4
5 姻 昼寝 0.9955055 5
6 姻 服 0.9954844 6
7 姻 娇态 0.9954329 7
8 姻 尺寸 0.9954248 8
9 姻 矮 0.9954141 9
10 姻 中间 0.9954020 10

but if i use this code,result is error
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 桂

v==v. but second code is error.why?can you help me?

below is all code:
library(readr)
library(word2vec)
path="cookbooks1.txt"
y=readr::read_lines(path)
x <- txt_clean_word2vec(y, ascii = FALSE, alpha = FALSE, tolower = TRUE, trim = TRUE)
write_lines(x, file = "test.txt")
model <- word2vec(x = "test.txt", min_count = 1) ## you need to change hyperparameters to your own
terminology <- summary(model)
example <- sample(terminology, size = 2)
class(example)
class(terminology)
v=terminology[10:11]
predict(model, newdata = v, type = "nearest")
v=c("桂" ,"姻" )
predict(model, newdata = v, type = "nearest")

You need to put your own written words in UTF-8 encoding

hello,
code is below:
v=terminology[which(terminology=="山")]
predict(model, newdata = v, type = "nearest")
$山
term1 term2 similarity rank
1 山 水 0.9285445 1
2 山 林 0.9080331 2
3 山 仙 0.9059756 3
4 山 远 0.9023114 4
5 山 右 0.8829441 5
6 山 主 0.8819309 6
7 山 隐 0.8805258 7
8 山 宽 0.8800192 8
9 山 洪 0.8799168 9
10 山 江 0.8798564 10

According to what you said, I did a custom coding setup with the following code:
v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
reslut is below:

v=c("山" )
Encoding(v) <- 'UTF-8'
predict(model, newdata = v, type = "nearest")
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 山

Look to the help of function iconv

Do you have some papers on the use of the word2vec package? or instruction?I would like to know how this package is used in practice. Thank you.

山 is different from 山 depending on your encoding

Encoding(terminology)
v=c("山" )
Encoding(v)
vutf8 <- iconv(v, from = <put here the encoding you have your v in>, to = "UTF-8")
Encoding(vutf8 )

You can look here for details on how to use the package at: https://www.bnosac.be/index.php/blog/100-word2vec-in-r

Since version 0.4.0 of this R package, the default in word2vec.character is to use useBytes is TRUE. Which should solve this issue. Let me know if this is the case.
You can now also make a word2vec model based on a list of tokenised sentences. So that you can use your own tokeniser.
Hopefull one of the 2 solutions is a fix for your issue. as.matrix.word2vec has also argument encoding as well as predict.word2vec

Closing. Feel free to reopen in case the issue seems not solved.