grandimk / NewsClustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

News Clustering via GibsLDA++

NewsClustering is a final project developed by a group of three students (Ilaria Ceppa, Marco Grandi and Marco Ponza) for the Information Retrieval course.

The goal of the project was to develop, experiment and analyze results of a clustering software which uses GibsLDA++ to generate clusters of italian news articles.

The final report is available in the current repository (italian only).

Setting up

The project can be compiled by typing:

make clean
make all

and the helper can be displayed with:

./clusteringLDA --help

Cluster Generation

To run the application on a news dataset type:

./clusteringLDA [-v] [-a alpha] [-b beta] [-n clusters] [-t terms] [-m size] [-i iter] [-s step] [-o file] [-c clust] [-d string] dataset_file

where:

  • -v shows the parameter values before running the application;
  • -a alpha set the alpha parameter of GibsLDA++;
  • -b beta set the beta parameter of GibsLDA++;
  • -n clusters set the number of clusters you want to generate;
  • -t terms set the number of terms that will be showed to the output file;
  • -m size minimum cluster size (clusters with a lower size will be removed);
  • -i iter set the number of iterations of GibsLDA++;
  • -s step set the number of iterations after which a temporary model will be generated;
  • -o file set the output file;
  • -c clust model name generated by GibsLDA++;
  • -d string set the preprocessing algorithms to NOT use:
  • . disables the punctuation filter;
  • s disables stopwords;
  • w disables shingling;
  • i disable the idf filter;
  • m disables cluster-size thresholding;
  • p disables document filter.

About


Languages

Language:C++ 96.7%Language:Makefile 2.5%Language:C 0.8%