This is a framework of algorithms for detecting communities in text. It was implemented in Python using the NetworkX and scikit-learn libraries.

Overview

The major features of this framework include the:

Implementation of community detection algorithms
- Label Propagation Algorithm
- Greedy Modularity Algorithm
- Girvan Newman Algorithm
- Edge Betweeness Algorithm (ULRIK, 2008)
Results processed separately and then merged
Sending e-mails to indicate the end of processing

Requeriments

The codebase is implemented in Python 3.7.9 (64-bit). Package versions used for development are just below.

networkx          2.5
numpy             1.19.0
pandas            1.1.2
sklearn           0.23.2

Datasets

For the experimental evaluations, we used 21 text collections from different domains, these datasets are available at: Sequence of words.

Options

The cluster model is handled by the tct.py script which provides the following command line arguments.

  --filename        STR        path to the CSV file to be processed       
  --dir             STR        path to a folder where to process CSV files in batch

Settings file

The setting file is defined by settings.json and provides the following adjustment options.

temp_files                  BOOL        defines if the temporary files will be generated to resume execution afterwards
default_output_path         STR         defines the directory of the output files
default_temp_path           STR         defines the directory of the temp files
batch_files_max_size        INT         defines a maximum file size to be processed in a folder
send_mail                   BOOL        defines if the e-mail will be sent after the execution of an experiment
screen_results              BOOL        defines whether the results will be displayed in the output console
config                      STR         defines the default configuration file that defines the algorithms and their parameters to be executed
delete_temp_folder          BOOL        defines if the temporary files folder will be deleted after the experiments are finished

Each configuration file receives a JSON file containing the algorithms and parameters to be executed and provides the following adjustment options.

network_type               STR        defines the type of networks to be generated
proximity_measure          STR        defines the measurement of distances used in the k-NN network
number_of_neighbours       LIST       defines a list of k values to be executed on the k-NN network
algorithm                  STR        defines the name of the algorithm to be executed
weight                     BOOL       defines if the algorithm used will include the weights of each network relation
max_iterations             INT        defines the maximum number of iterations of each algorithm to be executed

For more specific algorithm parameters, the standard parameters of the NetworkX.

Results

The results can be found at Clustering Algorihms Data. For more information, how the methodology used to obtain these results is found in the article.

Propagação de Rótulos em Redes para o Agrupamento de Textos. Sawada and Rossi, 2020 [Paper].

Examples

Here is an example of using the framework to process a text collection:

$ python tct.py CSTR.csv

To execute the files in batch, you can pass the directory of the files to be processed:

$ python tct.py term-frequency

nyvemm / text_clustering_network_tool