Jalend15 / Verified_Summarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Verified_Summarization

Initial Steps

Dataset

  • We use the dataset PHEME-RNR 9 which can be downloaded from here

POS TAGGING

$ java -jar twitie_tag.jar models/gate-EN-twitter.model $input_file > $output_file
  • $input_file: File with each line containing a tweet (only text - space separated words)
  • $output_file: space separated _ for each tweet

Processing Corpus File

Command:

$ python create-corpus-file.py

Configure the following input variables inside the code:

  • src_data_path: Folder containing files of Source Tweets. In every file, a line contains - DateTime, TweetId, UserId, Tweet Text, Rumor Tag (tab-separated)
  • rep_data_path: Folder containing files of Reply Tweets. In every file, a line contains - DateTime, TweetId, UserId, Tweet Text, Source Tweet Id (tab-separated)
  • src_pos_tag_path: Folder containing Source POS Tags files. In every file, a line contains - TweetId, Output of Twitie (space separated _) (space-separated)
  • rep_pos_tag_path: Folder containing Reply POS Tags files. In every file, a line contains - TweetId, Output of Twitie (space separated _) (space-separated)
  • tentative_path: File containing LIWC list of tentative words
  • certain_path: File containing LIWC list of certainty words
  • negate_path: File containing LIWC list of negative words
  • question_path: File containing list of question words

Configure the following output variables inside the code:

  • corpus_path: Corpus file (will be input to the topic model). Each line in the file contains: TweetId, Content-Words (space-separated), Expression-Words(Space-Separated), TweetType(S/R), Time(0-1)

Content-Expression Topic Model (CETM)

Compile:

$ g++ -std=c++11 topic-model.cpp -o model

Run:

$ ./model K T iter

where:

K:    Number of Content Word Topics
T:    Number of Expression Word Topics
iter: Number of iterations to run the model for

Default in the upcoming codes: K=30 | T = 10 | iter = 1000

Configure the following input variables inside the code:

  • corpus_path: Corpus file created in the previous step (preprocessing corpus file).

Configure the following output variables inside the code:

  • destpath: Folder where all the output files will be stored

Description of files inside the destination folder is as follows:

  • c-vocab-mapping.txt: Content words to indices mapping.
  • e-vocab-mapping.txt: Expression words to indices mapping.
  • behavior-mapping.txt: Tweet Type to indices mapping.
  • topic-priors.txt: Prior probability of content topics.
  • expression-priors.txt: Prior probability of expression topics.
  • c-topic-word-distribution.txt: Content Topic to Word Distribution.
  • e-topic-word-distribution.txt: Expression Topic to Word Distribution.
  • topic-behavior-distribution.txt: Topic to Behavior Distribution.
  • table-assignment-status.txt: Status of Data points seating.
  • top-c-topic-words.txt: Top 20 words in each content-word topic.
  • top-e-topic-words.txt: Top 20 words in each expression-word topic.
  • e-topic-time-alpha.txt: Expression-Topic-Time Alpha values.
  • e-topic-time-beta.txt: Expression-Topic-Time Beta values.
  • c-topic-time-alpha.txt: Content-Topic-Time Alpha values.
  • c-topic-time-beta.txt: Content-Topic-Time Beta values.

Computing Tweet Posteriors

Command:

$ python compute-posteriors.py

Configure the following input variables inside the code:

  • basepath: Folder created by topic-model.cpp
  • CORPUS_PATH: Corpus file created.

Configure the following output variables inside the code:

  • POSTERIOR_PATH: File where posteriors (probability vectors) for each tweet will be stored.

Verified Tweet Detection using Tree LSTM

Generating Trees

Command:

$ python generate-trees.py

Configure the following input variables inside the code:

  • datapath: The original dataset folder (download from here)
  • feature_path: File containing input feature vectors for all tweets in the dataset. The file contains two tab-separated columns - tweet_id, features
  • output_path: Path of the folder where you want the generated trees to be stored

Each tree is stored as a dictionary. A sample tree and the corresponding stored dictionary is shown below:

tree = {
        'f': [0.234, .... , ], 'l': [0, 1], 'c': [
            {'f': [0.109, ... , ], 'l': [0, 1], 'c': []},
            {'f': [0.712, ... , ], 'l': [0, 1], 'c': [
                {'f': [0.352, ... , ], 'l': [0, 1], 'c': []}
            ]},
        ],
    }

Here, f is the input feature vector for each node of the tree, l is the true label of the root of the tree stored as a 2-dimensional one-hot vector (dim-1: verified, dim-2: unverified), and c is the list of children of a node.

Training and Testing Tree-LSTM

Command:

$ python train-Tree-LSTM.py

Configure the following input variables inside the code:

  • tree_path: Path to the folder containing generate trees (output_path of the last step).
  • IN_FEATURES: Size of the input feature vectors
  • NUM_ITERATIONS: Number of iterations for training
  • BATCH_SIZE: Batch size for training
  • test_set: Disaster events on which you want to test.
$ python generate-trees.py

Configure the following input variables inside the code:

  • datapath: The original dataset folder (download from here)
  • feature_path: File containing input feature vectors for all tweets in the dataset. The file contains two tab-separated columns - tweet_id, features
  • output_path: Path of the folder where you want the generated trees to be stored
  • stance_path: Path of the folder where stance.json is available

Each tree is stored as a dictionary. A sample tree and the corresponding stored dictionary is shown below:

tree = {
        'f': [0.234, .... , ], 'l': [0, 1], 'stance' = [1,0,0,0], 'c': [
            {'f': [0.109, ... , ], 'l': [0, 1],'stance' = [0,1,0,0], 'c': []},
            {'f': [0.712, ... , ], 'l': [0, 1],'stance' = [0,0,1,0], 'c': [
                {'f': [0.352, ... , ], 'l': [0, 1], 'stance' = [0,0,0,1],'c': []}
            ]},
        ],
    }

Parsed-trees

Contains all the trees without stance generated from the Corpus.txt provided in the CTP folder

Parsed-trees_stance

Contains all the trees with stance generated from the Corpus.txt provided in the CTP folder

About


Languages

Language:Python 70.4%Language:C++ 29.6%