jalused / TTM

Targeted Topic Model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Targeted Topic Model (TTM)

A package for targeted topic modeling for focused analysis.

  1. A JAVA implementation for targeted topic modeling;
  2. Used for focused analysis purpose;
  3. By specifying the target (aspect) word to obtain its target-related topics

We are glad if the package helps your projects or research. Please cite our paper with the following information. You are welcome to contact Shuai Wang (shuaiwanghk@gmail.com) if you have any question.

Publication

Shuai Wang, Zhiyuan Chen, Geli Fei, Bing Liu and Sherry Emery, "Targeted Topic Modeling for Focused Analysis", SIGKDD 2016.

Table of Content

## Input Data Format

(1) docs.txt
Every single file in a given (domain) corpus is arranged in the following format.
Line 1: #numOfSent. The number of sentences for one review (always one for a tweet from Twitter)
Line 2: dummy field (Just a place holder. Not useful for modeling, but currently we still need to put it in the raw data. It was used for my debugging. Forgive me that I have not eliminated it. I might fix it and pull a new version later.)
Line 3 to (3+#numOfSent-1): content/text of a sentence.
(repeating the above format for all files)

Example:
3 // number of sentence for review 1;
0 // dummy field
1 2 3 // sentence 1 (in review 1)
4 5 // sentence 2 (in review 1)
6 7 8 // sentence 3 (in review 1)
2 // number of sentence for review 2
0 // dummy field
3 4 // sentence 1 (in review 2)
5 8// sentence 2 (in review 2)
....

The values like "1 2 3", "4 5" are word indexes corresponding to line numbers in the wordlist.txt file.

(2) wordlist.txt
a. This is a vocabulary file, which indexes words in a given domain.
b. The stop words and infrequent words have been removed.

## Model and Program Entry (1) Parameters and settings
A corresponding model will be saved.The parameters and argument settings are set in argument -> ProgramArgument.java file.

Among them, the most important settings are:<br > a. domainName (the domain/dataset name)<br > b. targetWord (the keyword of the targeted aspect)<br > c. tTopicNum (targeted topic number)<br > Please refer to ProgramArgument.java for details.

(2) Single task<br > The task file locates in task -> Execute TTMwithOneSingleTask, which is for running a single task. A corresponding model will be also saved.

(3) Multiple tasks/threads<br > We also provide a multiple tasks/threads entry so that we can target at different aspects parallelly.The task file locates in task -> RunTTMwithMultiTasks, which is for running multiple tasks.

## Output File An output file with targeted topic-word distribution will be generated in a file under data/output folder.

Note that I have rewritten my codes with some code optimization and reconstruction so the final produced results might be slightly different from my previous ones.

## Run Demo/Entry File (1) Run in IDE.
Two files are provided. You should be able to run them (with libraries in the lib folder added). They are:
src -> launcher -> TTMSingleTaskEntry
src -> launcher -> TTMMultipleTasksEntry

(2) Run in Terminal by command lines.
Under the TTM root directory in Windows:

java -cp bin;lib/* launcher.TTMSingleTaskEntry

Under the TTM root directory in Unix/Linux:

java -cp bin:lib/* launcher.TTMSingleTaskEntry

Have fun!

About

Targeted Topic Model

License:Apache License 2.0


Languages

Language:Java 100.0%