jeffzhengye / lightlda

Distributed LDA, takes raw text as input and outputs topic word table.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Light LDA

Modified based on MSR's Light LDA, added preprocessing scripts.

Usage (Suppose you are in lightlda/):

make
cd datasets
tar zxf 20news-train.tgz
python scripts/pipeline.py etc/params.config

Note: parameters are defined in etc/params.config. The result is put in output/model/${timestamp}/snapshot.word_topic_table.${iteration}${client_id}. By using python scripts/parse_word_topic_table.py a visualization can be obtained. The <word-id-file> is in output/datablocks/${timestamp}/word_tf.txt.

Note2: The machine file defined in etc/params.config only works on cogito. And the whole pipeline assumes a shared filesystem.

About

Distributed LDA, takes raw text as input and outputs topic word table.


Languages

Language:C++ 93.1%Language:Python 3.8%Language:C 1.9%Language:Makefile 1.1%