askerlee / topicvec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question in gramcount.pl

utuzenkyo opened this issue · comments

Hi Li, I'm trying to run PSDVec using Japanese Wikipedia text.

How did you deal with the stopwords in top1gram-wiki.txt? I ask you this because top2gram-wiki.txt is not uploaded on vectors&corpus Dropbox. And why there are no stopwords in top1gram-reuters.txt? I'd like to know when to delete the stopwords?

Sorry if it sound a stupid question.
Thank you!

Sorry for coming back to your question so late. Very busy these days... If memory serves me, I kept all stopwords because they are actually important for word embedding. Even "the" could indicate the following word might be a noun.

I remember top2gram-wiki.txt is a huge file (more than 10 GB?) so it's not uploaded.

I checked and indeed stop words in top1grams-reuters.txt are removed. I forgot why I did that... Maybe in the classification task, I removed stop words as a preprocessing step? But I'm sure when generating word embeddings for Reuters, stop words were kept.

I checked and indeed stop words in top1grams-reuters.txt are removed. I forgot why I did that... Maybe in the classification task, I removed stop words as a preprocessing step? But I'm sure when generating word embeddings for Reuters, stop words were kept.

Thank you so much for your clear explanations! Have a nice day!