dwiajik / twit-macet-mining

Python mining script for Twit Macet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python mining script for TwitMacet

This repository contains text mining scripts written in Python3 for TwitMacet project hosted in https://twitmacet.dwiajik.com. This readme will explain some of the files in the repository.

chunk.py

Parameter 1: text file name to be analyzed This script will do:

  • Analyze the text file line by line
  • Learn from pos_tagged_corpus
  • Tag every word with POS tag
  • Detect noun phrases and adjective phrases
  • Print them line by line

classify.py

Parameter 1: text file name to be analyzed and classified

Parameter 2: output CSV file, write without the extension

This script will do:

  • Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
  • Read file stated in parameter 1 line by line
  • Classify each tweet on each line
  • Save the result in CSV format

classify_evaluate_ten_folds.py

Parameter 1: "balance"/"imbalance" -> balance using balanced data 35,184 tweets, imbalance using imbalanced data 110,449 tweets

This script will do:

  • Read tweets_corpus
  • Conduct ten folds cross validation to the corpus by building classification model (Naive Bayes, SVM, Decision Tree) using sklearn library in each iteration
  • Print the results of each iteration
  • Print the final results (average)

count_words.py

Parameter 1: text file name to be analyzed

This script will do:

  • Read the text file
  • Do preprocessing
  • Count each word appearance within the file
  • Print top 50 word appearance

feature_word_list.txt

Top 50 word appearance from tweets_corpus/traffic_tweets_combined.txt is taken and cleaned up from unused words. We get 40 words as features.

get_random_tweets.py

Parameter 1: number of random tweets that you want to get

This script will do:

  • Stream tweets from Twitter Streaming API with track filter parameters:
track=['aku', 'mending', 'gak', 'nggak', 'ngga', 'oke', 'tapi', 'tidak']
  • Save tweets in random_tweets.txt

get_tweets.py

Parameter 1: twitter username

Get all tweets (Twitter REST API limit up to 3200 tweets) from a twitter username then save to [username].txt

init.sh

Bash script to install required Python intepreter and libraries.

name_list.txt

List of location name of Yogyakarta Province, Indonesia. Gathered from several source.

replacement_word_list.txt

List of abbreviations and their real phrases.

stop_words_list.txt

List of stopwords in Bahasa Indonesia, but not used in the project yet. Maybe useful later.

stream_and_classify.py

*Parameter 1: "prod" to only save result file in *.txt and .csv; "dev" to also print to screen

This script will do:

  • Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
  • Stream from Twitter Streaming API tweets of Yogyakarta Province, Indonesia
  • Classify tweets to "traffic" and "non_traffic"
  • Save to file and print to screen

stream_classify_save_db.py

*Parameter 1: "prod" to only save to mysql database; "dev" to also print result to screen, save file in *.txt and .csv

This script will do:

  • Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
  • Stream from Twitter Streaming API tweets of Yogyakarta Province, Indonesia
  • Classify tweets to "traffic" and "non_traffic"
  • Save to MySQL database
  • Save to file and print to screen (if "dev")

tagged_name_list.txt

List of location name of Yogyakarta Province, Indonesia that have been tagged with PRFX, B-LOC, and I-LOC tags.

tweet_object_example.json

Tweets object sample that will be returned by Twitter API.

About

Python mining script for Twit Macet

License:Apache License 2.0


Languages

Language:Python 99.7%Language:Shell 0.3%