Use another dataset
Diego999 opened this issue · comments
Hi,
I've seen that reuters and rcv1 seems hardcoded into the code.
In order to use a corpus without any label (1 txt file per document), what is the simplest way to achieve it ?
Thank you for your help !
Hi, you could start with file2topic.py. This script takes a single file (specified as an command line argument) as the input. You could use a loop to read a set of files.
Thank you for your answer ! I don't know how I couldn't see it ! Should be much simpler now to generate the needed files ;)
I have another question. I have the feeling that using only file2topic is enough. Why on your bat script, you don't use it and use instead topicExp and classEval ? Is it because they use a labeled dataset ?
TopicExp.py is used to do experiments in my ACL paper. I did use file2topic.py to get topic embeddings from individual documents. See my arxiv paper "topic cloud" for examples (https://arxiv.org/abs/1702.01520).
Thank you for your answer. I feel a little bit confused: If I want to do discover topics (with a fixed K) rom a set of documents without any labels, have top-n words for each of them, I should use topicExp without -s. Am I right ?
Yeah you could. But if you want to use topicEx.py, you have to write your own corpusLoader class.