News Clustering via GibsLDA++

NewsClustering is a final project developed by a group of three students (Ilaria Ceppa, Marco Grandi and Marco Ponza) for the Information Retrieval course.

The goal of the project was to develop, experiment and analyze results of a clustering software which uses GibsLDA++ to generate clusters of italian news articles.

The final report is available in the current repository (italian only).

Setting up

The project can be compiled by typing:

make clean
make all

and the helper can be displayed with:

./clusteringLDA --help

Cluster Generation

To run the application on a news dataset type:

./clusteringLDA [-v] [-a alpha] [-b beta] [-n clusters] [-t terms] [-m size] [-i iter] [-s step] [-o file] [-c clust] [-d string] dataset_file

where:

-v shows the parameter values before running the application;
-a alpha set the alpha parameter of GibsLDA++;
-b beta set the beta parameter of GibsLDA++;
-n clusters set the number of clusters you want to generate;
-t terms set the number of terms that will be showed to the output file;
-m size minimum cluster size (clusters with a lower size will be removed);
-i iter set the number of iterations of GibsLDA++;
-s step set the number of iterations after which a temporary model will be generated;
-o file set the output file;
-c clust model name generated by GibsLDA++;
-d string set the preprocessing algorithms to NOT use:
. disables the punctuation filter;
s disables stopwords;
w disables shingling;
i disable the idf filter;
m disables cluster-size thresholding;
p disables document filter.

grandimk / NewsClustering

News Clustering via GibsLDA++

Setting up

Cluster Generation

About

Languages