sample data, pre-trained word embedding

Question

sample data, pre-trained word embedding

hpduong opened this issue 7 years ago · comments

Henry commented 7 years ago

I'm getting this issue when I run training on a08 entity network and a06 seq2seq models.

Can I get or train this file?

zhihu-word2vec-title-desc.bin-100

Also, do you have sample datasets compatible with these models?

brightmart · Answer 1 · Tue Jul 25 2017 16:02:26 GMT+0800 (China Standard Time)

for zhihu-word2vec-title-desc.bin-100, please find it in:
https://pan.baidu.com/s/1jIP9e6q
[OLD, TO BE DELETE]
for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
https://pan.baidu.com/s/1gf49auB
[OLD, TO BE DELETE]
train-zhihu4-only-title-all.txt (single-label):
https://pan.baidu.com/s/1jI7R4X4
[NEW, try use this, update 2018-08-12]

zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:

https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA

train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)
test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)
train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

Deleted user · Answer 2 · Wed Jul 26 2017 00:21:32 GMT+0800 (China Standard Time)

Could you please upload it in a different service? I am having a hard time downloading the data and word2vec model. Thanks!

brightmart · Answer 3 · Fri Jul 28 2017 09:36:36 GMT+0800 (China Standard Time)

you can follow two steps to get the file.
step1: click the place for download

step2: download the file

Cauchyzhou · Answer 4 · Tue Sep 05 2017 11:59:50 GMT+0800 (China Standard Time)

When I run TextRNN model. Terminal reports :
IOError: [Errno 2] No such file or directory: '../zhihu-word2vec.bin-100'
There is no link about zhihu-word2vec.bin-100.

brightmart · Answer 5 · Tue Sep 05 2017 13:42:49 GMT+0800 (China Standard Time)

you may use zhihu-word2vec-title-desc.bin-100 or file from your own.
.bin file is just a word embedding file trained from word2vec.

jacky20172017 · Answer 6 · Fri Nov 10 2017 14:20:30 GMT+0800 (China Standard Time)

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'

jacky20172017 · Answer 7 · Fri Nov 10 2017 14:21:33 GMT+0800 (China Standard Time)

and also: train-zhihu6-title-desc.txt

jacky20172017 · Answer 8 · Fri Nov 10 2017 17:59:11 GMT+0800 (China Standard Time)

there're 2 data_util_zhihu.py in folders aa1_data_util and a07_Transformer
which one to import?

jacky20172017 · Answer 9 · Fri Nov 10 2017 21:32:46 GMT+0800 (China Standard Time)

what's the data_type ?

train, test, _ = load_data(vocabulary_word2index, vocabulary_word2index_label,data_type='train')
TypeError: load_data() got an unexpected keyword argument 'data_type'

Prateek Mahendrakar · Answer 10 · Fri Dec 01 2017 06:37:29 GMT+0800 (China Standard Time)

I cant see files there in the links mentioned. They give below error in Chinese

Oh, the page you visit does not exist.

possible reason:

Enter the wrong address in the address bar.
A link you clicked has expired.

#########################

for zhihu-word2vec-title-desc.bin-100, please find it in:
https://pan.baidu.com/s/1kVgdDD9

for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
https://pan.baidu.com/s/1gf49auB

train-zhihu4-only-title-all.txt (single-label):
https://pan.baidu.com/s/1jI7R4X4

#########################################

Could you please upload them to https://github.com/brightmart/text_classification in sample data folder?

Prateek Mahendrakar · Answer 11 · Fri Dec 01 2017 06:42:17 GMT+0800 (China Standard Time)

Nevermind. After few tries the above links worked.

But bin100 is not downloading for some reason/

AI_ROBOT · Answer 12 · Mon Dec 04 2017 11:27:17 GMT+0800 (China Standard Time)

@pmahend1 Same. The bin100 is not downloading even in China.

brightmart · Answer 13 · Tue Dec 05 2017 00:06:57 GMT+0800 (China Standard Time)

@pmahend1 @deatherving
for zhihu-word2vec-title-desc.bin-100, please use this:
https://pan.baidu.com/s/1jIP9e6q

AI_ROBOT · Answer 14 · Wed Dec 06 2017 09:49:18 GMT+0800 (China Standard Time)

@brightmart Thanks. The file is accessible.

sky_shine · Answer 15 · Wed Dec 06 2017 13:57:01 GMT+0800 (China Standard Time)

@brightmart
when I run p5_fastTextB_predict.py ,where come with the error below:
FileNotFoundError: [Errno 2] No such file or directory: 'test-zhihu-forpredict-v4only-title.txt'
In addition,where is zhihu-word2vec-multilabel.bin-100?

brightmart · Answer 16 · Wed Dec 06 2017 19:12:14 GMT+0800 (China Standard Time)

Hi, You can download some test data ,find it from closed issue. Bright

…

________________________________ 发件人: sky_shine <notifications@github.com> 发送时间: 2017年12月6日 13:57 收件人: brightmart/text_classification 抄送: brightmart; Mention 主题: Re: [brightmart/text_classification] sample data, pre-trained word embedding (#3) when I run p5_fastTextB_predict.py ,where come with the error below: FileNotFoundError: [Errno 2] No such file or directory: 'test-zhihu-forpredict-v4only-title.txt' In addition,where is zhihu-word2vec-multilabel.bin-100? ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#3 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASuYMKE4FWVVhDhdOPqUAKMvXC24J5gMks5s9iyvgaJpZM4Oezj1>.

Prateek Mahendrakar · Answer 17 · Fri Dec 08 2017 00:42:38 GMT+0800 (China Standard Time)

@brightmart I could download the file now. Thanks 👍

Behnaz Eslami · Answer 18 · Sun Dec 10 2017 18:11:25 GMT+0800 (China Standard Time)

@pmahend1
@brightmart Hi,
After running p8_TextRNN_train.py 've got this error :

File "./p8_TextRNN_train.py", line 117, in main
    test_loss, test_acc = do_eval(sess, textRNN, testX, testY, batch_size,vocabulary_index2word_label)
  File "./p8_TextRNN_train.py", line 167, in do_eval
    return eval_loss/float(eval_counter),eval_acc/float(eval_counter)
ZeroDivisionError: float division by zero

How can I solve this issue?
I checked do_eval function in train, there is a [for loop] that it doesn't work.
I attached images as below :

iqrasafder · Answer 19 · Mon Jan 15 2018 18:29:29 GMT+0800 (China Standard Time)

above links to download "zhihu-word2vec-title-desc.bin-100' are not working.
please share some working links to download data.

Thanks

iqrasafder · Answer 20 · Mon Jan 15 2018 18:32:38 GMT+0800 (China Standard Time)

@brightmart please share links to download dataset.

Aruna Neervannan · Answer 21 · Thu Jan 18 2018 04:06:26 GMT+0800 (China Standard Time)

@brightmart Do I need an account on pan.baidu.com to download the dataset. Can you please upload the data to the repo?

brightmart · Answer 22 · Mon Feb 26 2018 22:43:55 GMT+0800 (China Standard Time)

no need account.

parahaoer · Answer 23 · Mon Mar 26 2018 21:51:31 GMT+0800 (China Standard Time)

what directory should I put the file ‘zhihu-word2vec-title-desc.bin-100’？
Thanks！

liangtianxin · Answer 24 · Sat Apr 07 2018 00:19:17 GMT+0800 (China Standard Time)

thank you so much

JaeZheng · Answer 25 · Thu Apr 12 2018 15:16:05 GMT+0800 (China Standard Time)

@parahaoer I think put it in the same directory with the "model_train.py". For example, when you use TextCNN in the directory"a02_TextCNN", you process p7_TextCNN_train.py to train the model, at this time you should put the file 'zhihu-word2vec-title-desc.bin-100’ in the same directory.

harirajeev · Answer 26 · Sat Apr 14 2018 10:34:11 GMT+0800 (China Standard Time)

Could you please upload data 'zhihu-word2vec-title-desc.bin-100’ as well. The links do not work. Appreciate any quick response.

kevinsay · Answer 27 · Fri Jul 27 2018 10:31:43 GMT+0800 (China Standard Time)

@brightmart I load the zhihu-word2vec-title-desc.bin-100 as the wordvector file,train-zhihu4-only-title-all.txt as the trainning file,set multi_label_flag=false,use_embedding=true,
a01_FastText，a03_TextRNN，a04_TextRCNN，a05_HierarchicalAttentionNetwork，a06_Seq2seqWithAttention，these models can run，but the accuracy is very low，i don't know why.
and predict，also set multi_label_flag=false,use_embedding=true,there will be more than one prediction label，i need you help.thanks.

brightmart · Answer 28 · Sun Aug 12 2018 16:30:33 GMT+0800 (China Standard Time)

hi. thanks for your feedback. as long as you can see that training and validation loss during training process is decreasing, it will be fine. the previous reported f1 score is not a right indicator of accuracy. I am updating the way of how to compute f1 score today.

it is good to see that you can make it work for these several models. can you commit your version to this repository as a new branch?

hlshao · Answer 29 · Fri Aug 24 2018 14:26:00 GMT+0800 (China Standard Time)

Respect Sir:

It is a good project!
Could you please provide the file "zhihu-word2vec-title-desc.bin-100" in some where?
and the link below is out of date too...
Many thanks if you can help.

[NEW, try use this, update 2018-08-12]
zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:
https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA
train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)
test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)
train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

brightmart · Answer 30 · Fri Aug 24 2018 15:25:23 GMT+0800 (China Standard Time)

set hyper parameter of use pretrain word embedding to false, you will no need to have 'zhihu-word2vec-title-desc.bin-100.

…

________________________________ 发件人: hlshao <notifications@github.com> 发送时间: 2018年8月24日 14:26 收件人: brightmart/text_classification 抄送: brightmart; State change 主题: Re: [brightmart/text_classification] sample data, pre-trained word embedding (#3) Respect Sir: It is a good project! Could you please provide the file "zhihu-word2vec-title-desc.bin-100" in some where? and the link below is out of date too... Many thanks if you can help. [NEW, try use this, update 2018-08-12] zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files: https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels) test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels) train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.) ― You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub<#3 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASuYMCMg3LAtterhP33hDleahzuoJnmsks5uT5x-gaJpZM4Oezj1>.

Peter Heuz · Answer 31 · Tue Aug 28 2018 16:11:58 GMT+0800 (China Standard Time)

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

Gunjit bedi · Answer 32 · Thu Sep 06 2018 20:48:28 GMT+0800 (China Standard Time)

@brightmart Cant down load file 'test-zhihu-forpredict-title-desc-v6.txt', please share it on any other platform

brightmart · Answer 33 · Sun Nov 18 2018 03:17:10 GMT+0800 (China Standard Time)

re-generated data, and save as cached file, available to download. check this session in README.md:

#Sample data: cached file

wuliuyuedetian · Answer 34 · Thu Mar 21 2019 11:11:57 GMT+0800 (China Standard Time)

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'
Me too

Aishwarya Ramesh · Answer 35 · Fri Feb 14 2020 21:30:58 GMT+0800 (China Standard Time)

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

Hi, I'm facing the same problem, how did you solve it? Thanks in advance

litomvv · Answer 36 · Tue Aug 18 2020 10:11:31 GMT+0800 (China Standard Time)

hi.where is the file: /test-zhihu-forpredict-title-desc-v6.txt