brightmart / text_classification

all kinds of text classification models and more with deep learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sample data, pre-trained word embedding

hpduong opened this issue · comments

commented

I'm getting this issue when I run training on a08 entity network and a06 seq2seq models.

Can I get or train this file?

zhihu-word2vec-title-desc.bin-100

screenshot_20170720_171417

Also, do you have sample datasets compatible with these models?

  1. for zhihu-word2vec-title-desc.bin-100, please find it in:
    https://pan.baidu.com/s/1jIP9e6q

  2. [OLD, TO BE DELETE]
    for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
    https://pan.baidu.com/s/1gf49auB

  3. [OLD, TO BE DELETE]
    train-zhihu4-only-title-all.txt (single-label):
    https://pan.baidu.com/s/1jI7R4X4

  4. [NEW, try use this, update 2018-08-12]

zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:

https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA

  1. train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)

  2. test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)

  3. train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

Could you please upload it in a different service? I am having a hard time downloading the data and word2vec model. Thanks!

you can follow two steps to get the file.
step1: click the place for download
default

step2: download the file
step2

When I run TextRNN model. Terminal reports :
IOError: [Errno 2] No such file or directory: '../zhihu-word2vec.bin-100'
There is no link about zhihu-word2vec.bin-100.

you may use zhihu-word2vec-title-desc.bin-100 or file from your own.
.bin file is just a word embedding file trained from word2vec.

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'

and also: train-zhihu6-title-desc.txt

there're 2 data_util_zhihu.py in folders aa1_data_util and a07_Transformer
which one to import?

what's the data_type ?

train, test, _ = load_data(vocabulary_word2index, vocabulary_word2index_label,data_type='train')
TypeError: load_data() got an unexpected keyword argument 'data_type'

I cant see files there in the links mentioned. They give below error in Chinese

Oh, the page you visit does not exist.

possible reason:

  1. Enter the wrong address in the address bar.

  2. A link you clicked has expired.

#########################

for zhihu-word2vec-title-desc.bin-100, please find it in:
https://pan.baidu.com/s/1kVgdDD9

for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
https://pan.baidu.com/s/1gf49auB

train-zhihu4-only-title-all.txt (single-label):
https://pan.baidu.com/s/1jI7R4X4

#########################################

Could you please upload them to https://github.com/brightmart/text_classification in sample data folder?

Nevermind. After few tries the above links worked.

But bin100 is not downloading for some reason/

@pmahend1 Same. The bin100 is not downloading even in China.

@pmahend1 @deatherving
for zhihu-word2vec-title-desc.bin-100, please use this:
https://pan.baidu.com/s/1jIP9e6q

@brightmart Thanks. The file is accessible.

@brightmart
when I run p5_fastTextB_predict.py ,where come with the error below:
FileNotFoundError: [Errno 2] No such file or directory: 'test-zhihu-forpredict-v4only-title.txt'
In addition,where is zhihu-word2vec-multilabel.bin-100?

@brightmart I could download the file now. Thanks 👍

@pmahend1
@brightmart Hi,
After running p8_TextRNN_train.py 've got this error :

File "./p8_TextRNN_train.py", line 117, in main
    test_loss, test_acc = do_eval(sess, textRNN, testX, testY, batch_size,vocabulary_index2word_label)
  File "./p8_TextRNN_train.py", line 167, in do_eval
    return eval_loss/float(eval_counter),eval_acc/float(eval_counter)
ZeroDivisionError: float division by zero

How can I solve this issue?
I checked do_eval function in train, there is a [for loop] that it doesn't work.
I attached images as below :
111

222

above links to download "zhihu-word2vec-title-desc.bin-100' are not working.
please share some working links to download data.

Thanks

@brightmart please share links to download dataset.

@brightmart Do I need an account on pan.baidu.com to download the dataset. Can you please upload the data to the repo?

no need account.

what directory should I put the file ‘zhihu-word2vec-title-desc.bin-100’?
Thanks!

thank you so much

@parahaoer I think put it in the same directory with the "model_train.py". For example, when you use TextCNN in the directory"a02_TextCNN", you process p7_TextCNN_train.py to train the model, at this time you should put the file 'zhihu-word2vec-title-desc.bin-100’ in the same directory.

Could you please upload data 'zhihu-word2vec-title-desc.bin-100’ as well. The links do not work. Appreciate any quick response.

@brightmart I load the zhihu-word2vec-title-desc.bin-100 as the wordvector file,train-zhihu4-only-title-all.txt as the trainning file,set multi_label_flag=false,use_embedding=true,
a01_FastText,a03_TextRNN,a04_TextRCNN,a05_HierarchicalAttentionNetwork,a06_Seq2seqWithAttention,these models can run,but the accuracy is very low,i don't know why.
and predict,also set multi_label_flag=false,use_embedding=true,there will be more than one prediction label,i need you help.thanks.

hi. thanks for your feedback. as long as you can see that training and validation loss during training process is decreasing, it will be fine. the previous reported f1 score is not a right indicator of accuracy. I am updating the way of how to compute f1 score today.

it is good to see that you can make it work for these several models. can you commit your version to this repository as a new branch?

Respect Sir:

It is a good project!
Could you please provide the file "zhihu-word2vec-title-desc.bin-100" in some where?
and the link below is out of date too...
Many thanks if you can help.

[NEW, try use this, update 2018-08-12]
zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:
https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA
train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)
test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)
train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

screen shot 2018-08-29 at 08 05 16

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

@brightmart Cant down load file 'test-zhihu-forpredict-title-desc-v6.txt', please share it on any other platform

re-generated data, and save as cached file, available to download. check this session in README.md:

#Sample data: cached file

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'
Me too

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

screen shot 2018-08-29 at 08 05 16

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

Hi, I'm facing the same problem, how did you solve it? Thanks in advance

hi.where is the file: /test-zhihu-forpredict-title-desc-v6.txt