Scraping tweet tentang koinworks

we got em boys. keluhan classifier. classifiy which tweet are "keluhan terhadap telat bayar ke koinworks"

real questions

What high-level trends can be inferred from Koinworks tweets?
Are there any events that lead to spikes in koinworks Twitter activity?
Which topics are distinct from each other? disease outbreaks?

approach

finding the keluhan tweets

scrape the tweets
pretrain embeddings using flair
topic model using dbscan on flair's embedding
find a topic that describes a keluhan well
- from wordings
- random sample
re label the assumed keluhan tweet
- involves delete the tweets that aren't related to keluhan
after finding a good keluhan dataset find similar tweets (which aren't not the in the keluhan dataset)
- making sure all tweets are keluhan and not keluhan

methods

all these method to find the keluhan tweet are thematically related.

scraping: twint
topic model: ktrain's get_document_topic
similar texts: cosine similarity (on tfidf trained with ktrain's) k train is using the one class classification (svm)

twitter search keywords

koinworks
koinwork

pipeline

cleaning
eda something missing here
cluster_topic (choosing the complaining topic) [optional] check eda result on check_topics_clustering.ipynb
train classifier (bi-lstm) check on misinterpreted results, change accordingly
serve the model

news site?

kompas?
google? dunno, will do if feeling cute lol 1

preprocessing with texthero

remove brackets
remove diacritic
remove punctuation
remove numbers?
remove indo stopwords
drop duplicate tweets:
- promo2 gak jelas -> biasanya bot
- referal code -> biasa satu user bisa share beberapa kali
drop koinwork's own tweets

EDA

top 100 most common unigram
top 100 most common bigram
top 100 most common trigram
wordcloud
maybe topic modelling with LDA
distribusi kata yang merupakan keluhan
visualize
- ~~tfidf~~
- ~~kmeans~~
- flair (pca-ed lol)
- ~~lda~~
- ~~dbscan~~

labelling

search tweet with a definite "keluhan", then use cosine similarity to search similar ones, then label it too as keluhan cek di koinworks_labeled_lda.csv

mostlikely keluhan keywords: ['telat', ]

do as above but instead of keluhan, search for the "good thigs"
do as above but search, the non essential tweets (promo, etc)

frontend

classifier:
- menentukan apakah tweet itu komplain atau nggak ada penjelasan: ini ada di modulenya ktrain
dashboard:
- daily keluhan berapa
- top keywords keluhan
- label buat graph nya, selain warna
search engine
- bisa tau kasus mana yang mirip dengan yang dicari
  - ini ngelist username, tweet sama tanggal dia ngetweetnya

embeddings

this is a pooled document embeddings on:

flair

pretrain with lm-forward + tweets
make tweet encoder flair model can be downloaded here

fasttext-id

WordEmbeddings('id-crawl')

search engine

annoy

~~make tree~~ exhaustive aja

classifier

train a tf model / fastai model
onnx / fastinference

milvus

~~milvus~~

make embedding:
- tfidf, done tfidf.pkl
- fasttext
- flairembeddings
  - ~~ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2. gak tau padahal gak adayang bikin dimensi 3~~ ganti ke scipy
gak jadi pake milvus, soalnya dia ternyata framework yang jadi satu sama rest api nya

id nya ikut di 0_koinworks_raw.csv udah dibikin uuid4 biar gampang bikin indexernya

potential complaint topics

lda

1, 29, 7

topics covers a range of complaints
- cs not replying
- website error
- app error
- dana gak bisa ditarik
- tiba tiba tenor berubah

kmeans

determining k, by using converged silhouette score, check_topics_clustering.ipynb on kmeans dumb random shit. decided not to use it as a clustering method

tested both in tfidf, and flair embeddings

dbscan

9, sucks.

top2vec

see /experiments/

blog post ideas

ini buat opening
- meme: top: SHARE KODE KW
- meme: bottom: KU TERTYPU OLEH KW

extras

aplikasinya sempet ilang juga lol cek id: 517, 529, , cek tanggal, cek sumber
dari search twitter sempet peak di 263 tweet di 04-02-2020 dan 09-01-2020
dari kmeans, langsung kepisah dengan cantik 3 label
siap, didanai tapi tenor kagak dibayar :)))
- sumber2
- sumber3
didanai lagi dong woah

svmihar / experimental-koinworks-complaints