Verified_Summarization

Initial Steps

Dataset

We use the dataset PHEME-RNR 9 which can be downloaded from here

POS TAGGING

Use GATE Twitter Part-of-Speech Tagger
Command:

$ java -jar twitie_tag.jar models/gate-EN-twitter.model $input_file > $output_file

$input_file: File with each line containing a tweet (only text - space separated words)
$output_file: space separated _ for each tweet

Processing Corpus File

Command:

$ python create-corpus-file.py

Configure the following input variables inside the code:

src_data_path: Folder containing files of Source Tweets. In every file, a line contains - DateTime, TweetId, UserId, Tweet Text, Rumor Tag (tab-separated)
rep_data_path: Folder containing files of Reply Tweets. In every file, a line contains - DateTime, TweetId, UserId, Tweet Text, Source Tweet Id (tab-separated)
src_pos_tag_path: Folder containing Source POS Tags files. In every file, a line contains - TweetId, Output of Twitie (space separated _) (space-separated)
rep_pos_tag_path: Folder containing Reply POS Tags files. In every file, a line contains - TweetId, Output of Twitie (space separated _) (space-separated)
tentative_path: File containing LIWC list of tentative words
certain_path: File containing LIWC list of certainty words
negate_path: File containing LIWC list of negative words
question_path: File containing list of question words

Configure the following output variables inside the code:

corpus_path: Corpus file (will be input to the topic model). Each line in the file contains: TweetId, Content-Words (space-separated), Expression-Words(Space-Separated), TweetType(S/R), Time(0-1)

Content-Expression Topic Model (CETM)

Compile:

$ g++ -std=c++11 topic-model.cpp -o model

Run:

$ ./model K T iter

where:

K:    Number of Content Word Topics
T:    Number of Expression Word Topics
iter: Number of iterations to run the model for

Default in the upcoming codes: K=30 | T = 10 | iter = 1000

Configure the following input variables inside the code:

corpus_path: Corpus file created in the previous step (preprocessing corpus file).

Configure the following output variables inside the code:

destpath: Folder where all the output files will be stored

Description of files inside the destination folder is as follows:

c-vocab-mapping.txt: Content words to indices mapping.
e-vocab-mapping.txt: Expression words to indices mapping.
behavior-mapping.txt: Tweet Type to indices mapping.
topic-priors.txt: Prior probability of content topics.
expression-priors.txt: Prior probability of expression topics.
c-topic-word-distribution.txt: Content Topic to Word Distribution.
e-topic-word-distribution.txt: Expression Topic to Word Distribution.
topic-behavior-distribution.txt: Topic to Behavior Distribution.
table-assignment-status.txt: Status of Data points seating.
top-c-topic-words.txt: Top 20 words in each content-word topic.
top-e-topic-words.txt: Top 20 words in each expression-word topic.
e-topic-time-alpha.txt: Expression-Topic-Time Alpha values.
e-topic-time-beta.txt: Expression-Topic-Time Beta values.
c-topic-time-alpha.txt: Content-Topic-Time Alpha values.
c-topic-time-beta.txt: Content-Topic-Time Beta values.

Computing Tweet Posteriors

Command:

$ python compute-posteriors.py

Configure the following input variables inside the code:

basepath: Folder created by topic-model.cpp
CORPUS_PATH: Corpus file created.

Configure the following output variables inside the code:

POSTERIOR_PATH: File where posteriors (probability vectors) for each tweet will be stored.

Verified Tweet Detection using Tree LSTM

Generating Trees

Command:

$ python generate-trees.py

Configure the following input variables inside the code:

datapath: The original dataset folder (download from here)
feature_path: File containing input feature vectors for all tweets in the dataset. The file contains two tab-separated columns - tweet_id, features
output_path: Path of the folder where you want the generated trees to be stored

Each tree is stored as a dictionary. A sample tree and the corresponding stored dictionary is shown below:

tree = {
        'f': [0.234, .... , ], 'l': [0, 1], 'c': [
            {'f': [0.109, ... , ], 'l': [0, 1], 'c': []},
            {'f': [0.712, ... , ], 'l': [0, 1], 'c': [
                {'f': [0.352, ... , ], 'l': [0, 1], 'c': []}
            ]},
        ],
    }

Here, f is the input feature vector for each node of the tree, l is the true label of the root of the tree stored as a 2-dimensional one-hot vector (dim-1: verified, dim-2: unverified), and c is the list of children of a node.

Training and Testing Tree-LSTM

Command:

$ python train-Tree-LSTM.py

Configure the following input variables inside the code:

tree_path: Path to the folder containing generate trees (output_path of the last step).
IN_FEATURES: Size of the input feature vectors
NUM_ITERATIONS: Number of iterations for training
BATCH_SIZE: Batch size for training
test_set: Disaster events on which you want to test.

$ python generate-trees.py

Configure the following input variables inside the code:

datapath: The original dataset folder (download from here)
feature_path: File containing input feature vectors for all tweets in the dataset. The file contains two tab-separated columns - tweet_id, features
output_path: Path of the folder where you want the generated trees to be stored
stance_path: Path of the folder where stance.json is available

Each tree is stored as a dictionary. A sample tree and the corresponding stored dictionary is shown below:

tree = {
        'f': [0.234, .... , ], 'l': [0, 1], 'stance' = [1,0,0,0], 'c': [
            {'f': [0.109, ... , ], 'l': [0, 1],'stance' = [0,1,0,0], 'c': []},
            {'f': [0.712, ... , ], 'l': [0, 1],'stance' = [0,0,1,0], 'c': [
                {'f': [0.352, ... , ], 'l': [0, 1], 'stance' = [0,0,0,1],'c': []}
            ]},
        ],
    }

Parsed-trees

Contains all the trees without stance generated from the Corpus.txt provided in the CTP folder

Parsed-trees_stance

Contains all the trees with stance generated from the Corpus.txt provided in the CTP folder

Jalend15 / Verified_Summarization

Verified_Summarization

Initial Steps

Dataset

POS TAGGING

Processing Corpus File

Content-Expression Topic Model (CETM)

Computing Tweet Posteriors

Verified Tweet Detection using Tree LSTM

Generating Trees

Training and Testing Tree-LSTM

Parsed-trees

Parsed-trees_stance

About

Languages