askerlee / topicvec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use another dataset

Diego999 opened this issue · comments

Hi,

I've seen that reuters and rcv1 seems hardcoded into the code.

In order to use a corpus without any label (1 txt file per document), what is the simplest way to achieve it ?

Thank you for your help !

Hi, you could start with file2topic.py. This script takes a single file (specified as an command line argument) as the input. You could use a loop to read a set of files.

Thank you for your answer ! I don't know how I couldn't see it ! Should be much simpler now to generate the needed files ;)

I have another question. I have the feeling that using only file2topic is enough. Why on your bat script, you don't use it and use instead topicExp and classEval ? Is it because they use a labeled dataset ?

TopicExp.py is used to do experiments in my ACL paper. I did use file2topic.py to get topic embeddings from individual documents. See my arxiv paper "topic cloud" for examples (https://arxiv.org/abs/1702.01520).

Thank you for your answer. I feel a little bit confused: If I want to do discover topics (with a fixed K) rom a set of documents without any labels, have top-n words for each of them, I should use topicExp without -s. Am I right ?

Yeah you could. But if you want to use topicEx.py, you have to write your own corpusLoader class.