we got em boys. keluhan classifier. classifiy which tweet are "keluhan terhadap telat bayar ke koinworks"
- What high-level trends can be inferred from Koinworks tweets?
- Are there any events that lead to spikes in koinworks Twitter activity?
- Which topics are distinct from each other? disease outbreaks?
- scrape the tweets
- pretrain embeddings using flair
- topic model using dbscan on flair's embedding
- find a topic that describes a keluhan well
- from wordings
- random sample
- re label the assumed keluhan tweet
- involves delete the tweets that aren't related to keluhan
- after finding a good keluhan dataset find similar tweets (which aren't not the in the keluhan dataset)
- making sure all tweets are keluhan and not keluhan
all these method to find the keluhan tweet are thematically related.
- scraping: twint
- topic model: ktrain's get_document_topic
- similar texts: cosine similarity (on tfidf trained with ktrain's) k train is using the one class classification (svm)
- koinworks
- koinwork
- cleaning
- eda something missing here
- cluster_topic (choosing the complaining topic)
[optional] check eda result on
check_topics_clustering.ipynb
- train classifier (bi-lstm) check on misinterpreted results, change accordingly
- serve the model
- kompas?
- google? dunno, will do if feeling cute lol 1
- remove brackets
- remove diacritic
- remove punctuation
- remove numbers?
- remove indo stopwords
- drop duplicate tweets:
- promo2 gak jelas -> biasanya bot
- referal code -> biasa satu user bisa share beberapa kali
- drop koinwork's own tweets
- top 100 most common unigram
- top 100 most common bigram
- top 100 most common trigram
- wordcloud
- maybe topic modelling with LDA
- distribusi kata yang merupakan keluhan
- visualize
tfidfkmeans- flair (pca-ed lol)
ldadbscan
- search tweet with a definite "keluhan", then use cosine similarity to search similar ones, then label it too as keluhan
cek di
koinworks_labeled_lda.csv
mostlikely keluhan keywords: ['telat', ]
- do as above but instead of keluhan, search for the "good thigs"
- do as above but search, the non essential tweets (promo, etc)
- classifier:
- menentukan apakah tweet itu komplain atau nggak ada penjelasan: ini ada di modulenya ktrain
- dashboard:
- daily keluhan berapa
- top keywords keluhan
- label buat graph nya, selain warna
- search engine
- bisa tau kasus mana yang mirip dengan yang dicari
- ini ngelist username, tweet sama tanggal dia ngetweetnya
- bisa tau kasus mana yang mirip dengan yang dicari
this is a pooled document embeddings on:
- pretrain with lm-forward + tweets
- make tweet encoder flair model can be downloaded here
WordEmbeddings('id-crawl')
make treeexhaustive aja
- train a tf model / fastai model
- onnx / fastinference
- make embedding:
- tfidf, done
tfidf.pkl
- fasttext
- flairembeddings
ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2. gak tau padahal gak adayang bikin dimensi 3ganti ke scipy
- tfidf, done
- gak jadi pake milvus, soalnya dia ternyata framework yang jadi satu sama rest api nya
id nya ikut di 0_koinworks_raw.csv
udah dibikin uuid4
biar gampang bikin indexernya
1, 29, 7
- topics covers a range of complaints
- cs not replying
- website error
- app error
- dana gak bisa ditarik
- tiba tiba tenor berubah
determining k, by using converged silhouette score, check_topics_clustering.ipynb
on kmeans
dumb random shit. decided not to use it as a clustering method
- tested both in tfidf, and flair embeddings
9, sucks.
see /experiments/
- ini buat opening
- meme: top: SHARE KODE KW
- meme: bottom: KU TERTYPU OLEH KW