svmihar / experimental-koinworks-complaints

komplain yang baik adalah komplain bersama.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scraping tweet tentang koinworks

we got em boys. keluhan classifier. classifiy which tweet are "keluhan terhadap telat bayar ke koinworks"

real questions

  1. What high-level trends can be inferred from Koinworks tweets?
  2. Are there any events that lead to spikes in koinworks Twitter activity?
  3. Which topics are distinct from each other? disease outbreaks?

approach

finding the keluhan tweets

  • scrape the tweets
  • pretrain embeddings using flair
  • topic model using dbscan on flair's embedding
  • find a topic that describes a keluhan well
    • from wordings
    • random sample
  • re label the assumed keluhan tweet
    • involves delete the tweets that aren't related to keluhan
  • after finding a good keluhan dataset find similar tweets (which aren't not the in the keluhan dataset)
    • making sure all tweets are keluhan and not keluhan

methods

all these method to find the keluhan tweet are thematically related.

  • scraping: twint
  • topic model: ktrain's get_document_topic
  • similar texts: cosine similarity (on tfidf trained with ktrain's) k train is using the one class classification (svm)

twitter search keywords

  • koinworks
  • koinwork

pipeline

  1. cleaning
  2. eda something missing here
  3. cluster_topic (choosing the complaining topic) [optional] check eda result on check_topics_clustering.ipynb
  4. train classifier (bi-lstm) check on misinterpreted results, change accordingly
  5. serve the model

news site?

  • kompas?
  • google? dunno, will do if feeling cute lol 1

preprocessing with texthero

  • remove brackets
  • remove diacritic
  • remove punctuation
  • remove numbers?
  • remove indo stopwords
  • drop duplicate tweets:
    • promo2 gak jelas -> biasanya bot
    • referal code -> biasa satu user bisa share beberapa kali
  • drop koinwork's own tweets

EDA

  • top 100 most common unigram
  • top 100 most common bigram
  • top 100 most common trigram
  • wordcloud
  • maybe topic modelling with LDA
  • distribusi kata yang merupakan keluhan
  • visualize
    • tfidf
    • kmeans
    • flair (pca-ed lol)
    • lda
    • dbscan

labelling

  • search tweet with a definite "keluhan", then use cosine similarity to search similar ones, then label it too as keluhan cek di koinworks_labeled_lda.csv

mostlikely keluhan keywords: ['telat', ]

  • do as above but instead of keluhan, search for the "good thigs"
  • do as above but search, the non essential tweets (promo, etc)

frontend

  • classifier:
    • menentukan apakah tweet itu komplain atau nggak ada penjelasan: ini ada di modulenya ktrain
  • dashboard:
    • daily keluhan berapa
    • top keywords keluhan
    • label buat graph nya, selain warna
  • search engine
    • bisa tau kasus mana yang mirip dengan yang dicari
      • ini ngelist username, tweet sama tanggal dia ngetweetnya

embeddings

this is a pooled document embeddings on:

flair

  • pretrain with lm-forward + tweets
  • make tweet encoder flair model can be downloaded here

fasttext-id

  • WordEmbeddings('id-crawl')

search engine

annoy

  • make tree exhaustive aja

classifier

  • train a tf model / fastai model
  • onnx / fastinference

milvus

milvus

  • make embedding:
    • tfidf, done tfidf.pkl
    • fasttext
    • flairembeddings
      • ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2. gak tau padahal gak adayang bikin dimensi 3 ganti ke scipy
  • gak jadi pake milvus, soalnya dia ternyata framework yang jadi satu sama rest api nya

id nya ikut di 0_koinworks_raw.csv udah dibikin uuid4 biar gampang bikin indexernya

potential complaint topics

lda

1, 29, 7

  • topics covers a range of complaints
    • cs not replying
    • website error
    • app error
    • dana gak bisa ditarik
    • tiba tiba tenor berubah

kmeans

determining k, by using converged silhouette score, check_topics_clustering.ipynb on kmeans dumb random shit. decided not to use it as a clustering method

  • tested both in tfidf, and flair embeddings

dbscan

9, sucks.

top2vec

see /experiments/

blog post ideas

extras

  • aplikasinya sempet ilang juga lol cek id: 517, 529, , cek tanggal, cek sumber

  • dari search twitter sempet peak di 263 tweet di 04-02-2020 dan 09-01-2020

  • dari kmeans, langsung kepisah dengan cantik 3 label

  • siap, didanai tapi tenor kagak dibayar :)))

  • didanai lagi dong woah

About

komplain yang baik adalah komplain bersama.


Languages

Language:Jupyter Notebook 99.2%Language:Python 0.8%