This is a framework of algorithms for detecting communities in text. It was implemented in Python using the NetworkX and scikit-learn libraries.
The major features of this framework include the:
- Implementation of community detection algorithms
- Label Propagation Algorithm
- Greedy Modularity Algorithm
- Girvan Newman Algorithm
- Edge Betweeness Algorithm (ULRIK, 2008)
- Results processed separately and then merged
- Sending e-mails to indicate the end of processing
The codebase is implemented in Python 3.7.9 (64-bit). Package versions used for development are just below.
networkx 2.5
numpy 1.19.0
pandas 1.1.2
sklearn 0.23.2
For the experimental evaluations, we used 21 text collections from different domains, these datasets are available at: Sequence of words.
The cluster model is handled by the tct.py script which provides the following command line arguments.
--filename STR path to the CSV file to be processed
--dir STR path to a folder where to process CSV files in batch
The setting file is defined by settings.json and provides the following adjustment options.
temp_files BOOL defines if the temporary files will be generated to resume execution afterwards
default_output_path STR defines the directory of the output files
default_temp_path STR defines the directory of the temp files
batch_files_max_size INT defines a maximum file size to be processed in a folder
send_mail BOOL defines if the e-mail will be sent after the execution of an experiment
screen_results BOOL defines whether the results will be displayed in the output console
config STR defines the default configuration file that defines the algorithms and their parameters to be executed
delete_temp_folder BOOL defines if the temporary files folder will be deleted after the experiments are finished
Each configuration file receives a JSON file containing the algorithms and parameters to be executed and provides the following adjustment options.
network_type STR defines the type of networks to be generated
proximity_measure STR defines the measurement of distances used in the k-NN network
number_of_neighbours LIST defines a list of k values to be executed on the k-NN network
algorithm STR defines the name of the algorithm to be executed
weight BOOL defines if the algorithm used will include the weights of each network relation
max_iterations INT defines the maximum number of iterations of each algorithm to be executed
For more specific algorithm parameters, the standard parameters of the NetworkX.
The results can be found at Clustering Algorihms Data. For more information, how the methodology used to obtain these results is found in the article.
Propagação de Rótulos em Redes para o Agrupamento de Textos. Sawada and Rossi, 2020 [Paper].
Here is an example of using the framework to process a text collection:
$ python tct.py CSTR.csv
To execute the files in batch, you can pass the directory of the files to be processed:
$ python tct.py term-frequency