This repository contains text mining scripts written in Python3 for TwitMacet project hosted in https://twitmacet.dwiajik.com. This readme will explain some of the files in the repository.
chunk.py
Parameter 1: text file name to be analyzed This script will do:
- Analyze the text file line by line
- Learn from pos_tagged_corpus
- Tag every word with POS tag
- Detect noun phrases and adjective phrases
- Print them line by line
classify.py
Parameter 1: text file name to be analyzed and classified
Parameter 2: output CSV file, write without the extension
This script will do:
- Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
- Read file stated in parameter 1 line by line
- Classify each tweet on each line
- Save the result in CSV format
classify_evaluate_ten_folds.py
Parameter 1: "balance"/"imbalance" -> balance using balanced data 35,184 tweets, imbalance using imbalanced data 110,449 tweets
This script will do:
- Read tweets_corpus
- Conduct ten folds cross validation to the corpus by building classification model (Naive Bayes, SVM, Decision Tree) using sklearn library in each iteration
- Print the results of each iteration
- Print the final results (average)
count_words.py
Parameter 1: text file name to be analyzed
This script will do:
- Read the text file
- Do preprocessing
- Count each word appearance within the file
- Print top 50 word appearance
feature_word_list.txt
Top 50 word appearance from tweets_corpus/traffic_tweets_combined.txt is taken and cleaned up from unused words. We get 40 words as features.
get_random_tweets.py
Parameter 1: number of random tweets that you want to get
This script will do:
- Stream tweets from Twitter Streaming API with track filter parameters:
track=['aku', 'mending', 'gak', 'nggak', 'ngga', 'oke', 'tapi', 'tidak']
- Save tweets in random_tweets.txt
get_tweets.py
Parameter 1: twitter username
Get all tweets (Twitter REST API limit up to 3200 tweets) from a twitter username then save to [username].txt
init.sh
Bash script to install required Python intepreter and libraries.
name_list.txt
List of location name of Yogyakarta Province, Indonesia. Gathered from several source.
replacement_word_list.txt
List of abbreviations and their real phrases.
stop_words_list.txt
List of stopwords in Bahasa Indonesia, but not used in the project yet. Maybe useful later.
stream_and_classify.py
*Parameter 1: "prod" to only save result file in *.txt and .csv; "dev" to also print to screen
This script will do:
- Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
- Stream from Twitter Streaming API tweets of Yogyakarta Province, Indonesia
- Classify tweets to "traffic" and "non_traffic"
- Save to file and print to screen
stream_classify_save_db.py
*Parameter 1: "prod" to only save to mysql database; "dev" to also print result to screen, save file in *.txt and .csv
This script will do:
- Build classification model (Naive Bayes, SVM, Decision Tree) from tweets_corpus using sklearn library
- Stream from Twitter Streaming API tweets of Yogyakarta Province, Indonesia
- Classify tweets to "traffic" and "non_traffic"
- Save to MySQL database
- Save to file and print to screen (if "dev")
tagged_name_list.txt
List of location name of Yogyakarta Province, Indonesia that have been tagged with PRFX, B-LOC, and I-LOC tags.
tweet_object_example.json
Tweets object sample that will be returned by Twitter API.