Use another dataset

Question

Use another dataset

Diego999 opened this issue 7 years ago · comments

Diego Antognini commented 7 years ago

Hi,

I've seen that reuters and rcv1 seems hardcoded into the code.

In order to use a corpus without any label (1 txt file per document), what is the simplest way to achieve it ?

Thank you for your help !

askerlee · Answer 1 · Thu Oct 12 2017 21:36:06 GMT+0800 (China Standard Time)

Hi, you could start with file2topic.py. This script takes a single file (specified as an command line argument) as the input. You could use a loop to read a set of files.

Diego Antognini · Answer 2 · Thu Oct 12 2017 21:54:25 GMT+0800 (China Standard Time)

Thank you for your answer ! I don't know how I couldn't see it ! Should be much simpler now to generate the needed files ;)

Diego Antognini · Answer 3 · Fri Oct 13 2017 23:08:09 GMT+0800 (China Standard Time)

I have another question. I have the feeling that using only file2topic is enough. Why on your bat script, you don't use it and use instead topicExp and classEval ? Is it because they use a labeled dataset ?

askerlee · Answer 4 · Fri Oct 13 2017 23:13:14 GMT+0800 (China Standard Time)

TopicExp.py is used to do experiments in my ACL paper. I did use file2topic.py to get topic embeddings from individual documents. See my arxiv paper "topic cloud" for examples (https://arxiv.org/abs/1702.01520).

Diego Antognini · Answer 5 · Sat Oct 14 2017 00:11:17 GMT+0800 (China Standard Time)

Thank you for your answer. I feel a little bit confused: If I want to do discover topics (with a fixed K) rom a set of documents without any labels, have top-n words for each of them, I should use topicExp without -s. Am I right ?

askerlee · Answer 6 · Sat Oct 14 2017 16:08:03 GMT+0800 (China Standard Time)

Yeah you could. But if you want to use topicEx.py, you have to write your own corpusLoader class.