akirawisnu / text-clustering

learn about indonesian text classification and topics modeling

Learning Clustering (BahasaIndonesia)

Code

source: http://brandonrose.org/clustering modified by : kirra

Data sources

Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.

Step

tokenizing and stemming each article (Bahasa Indonesia)
transforming the corpus into vector space using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling using Latent Dirichlet Allocation (LDA)

How to use

download the new (kompas and tempo) extract to folder "data"
create virtualenvironment python >>> $ virtualenv env
activate virtualenvironment >>> source env/bin/activate
install all depedencies >>> pip install -r requirements.txt
run jupiter >>> jupyter notebook
open file "Clustering.ipynb"

Example visualization

Source for vosualization

About

learn about indonesian text classification and topics modeling

Languages

Language:Jupyter Notebook 86.4%Language:JavaScript 13.1%Language:CSS 0.3%Language:HTML 0.2%Language:Python 0.0%