akirawisnu / text-clustering

learn about indonesian text classification and topics modeling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Learning Clustering (BahasaIndonesia)

Code

source: http://brandonrose.org/clustering modified by : kirra

Data sources

  1. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  2. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.

Step

  1. tokenizing and stemming each article (Bahasa Indonesia)
  2. transforming the corpus into vector space using tf-idf
  3. calculating cosine distance between each document as a measure of similarity
  4. clustering the documents using the k-means algorithm
  5. using multidimensional scaling to reduce dimensionality within the corpus
  6. plotting the clustering output using matplotlib and mpld3
  7. conducting a hierarchical clustering on the corpus using Ward clustering
  8. plotting a Ward dendrogram
  9. topic modeling using Latent Dirichlet Allocation (LDA)

How to use

  1. download the new (kompas and tempo) extract to folder "data"
  2. create virtualenvironment python >>> $ virtualenv env
  3. activate virtualenvironment >>> source env/bin/activate
  4. install all depedencies >>> pip install -r requirements.txt
  5. run jupiter >>> jupyter notebook
  6. open file "Clustering.ipynb"

Example visualization

alt text

alt text

alt text

Source for vosualization

  1. http://adilmoujahid.com/posts/2015/01/interactive-data-visualization-d3-dc-python-mongodb/
  2. http://bl.ocks.org/lmatteis/efd9be8f472e673eef6ce9d1951256a9
  3. https://bl.ocks.org/bricedev/8b2da06ddef27d94cde9
  4. https://bl.ocks.org/jyucsiro/767539a876836e920e38bc80d2031ba7
  5. https://bl.ocks.org/emeeks/df6ea0128724289337ef

About

learn about indonesian text classification and topics modeling


Languages

Language:Jupyter Notebook 86.4%Language:JavaScript 13.1%Language:CSS 0.3%Language:HTML 0.2%Language:Python 0.0%