This is code of paper of IEEE SMC 2021:"A Simple and Effective Usage of Self-supervised Contrastive Learning for Text Clustering" [paper]
train.py
: The main script for this this project, including default BERT, loss function, and clustering accuracy metricload_data.py
: Load the data of sup, unsupmodels.py
: Model calsses for a general transformer (from Pytorchic BERT's code) and LSTM modeltrain.py
: A custom training class(Trainer class)- utils
configuration.py
: Set a configuration from json filecheckpoint.py
: Functions to load a model from tensorflow's file (from Pytorchic BERT's code)optim.py
: Optimizer (BERTAdam class) (from Pytorchic BERT's code)tokenization.py
: Tokenizers adopted from the original Google BERT's codeutils.py
: A custom utility functions adopted from Pytorchic BERT's code
- Dataprocessing_util
backtranslation.py
: Backtraslation scriptfewshotprocessing.py
: Fewshot preprocessing scriptReuters_processing.py
: Processing Reuters dataset
First, you have to download pre-trained BERT_base from Google's BERT repository. After running, you can get the pre-trained BERT_base_Uncased model at /BERT_Base_Uncased director and /data
preparing datasets: Reuters,20newsgroup, stackoverflow, SearchSnippets
Please install the required package
And then run main.py script
You can choose different models in main.py script
All the evaluation result will print in your screen, and you can also save them
Thanks to references of UDA and Pytorchic BERT, I can implement this code.