self-supervise contrastive learning for long and short texts clustering

This is code of paper of IEEE SMC 2021:"A Simple and Effective Usage of Self-supervised Contrastive Learning for Text Clustering" [paper]

Overview

train.py : The main script for this this project, including default BERT, loss function, and clustering accuracy metric
load_data.py : Load the data of sup, unsup
models.py : Model calsses for a general transformer (from Pytorchic BERT's code) and LSTM model
train.py : A custom training class(Trainer class)
utils
- configuration.py : Set a configuration from json file
- checkpoint.py : Functions to load a model from tensorflow's file (from Pytorchic BERT's code)
- optim.py : Optimizer (BERTAdam class) (from Pytorchic BERT's code)
- tokenization.py : Tokenizers adopted from the original Google BERT's code
- utils.py : A custom utility functions adopted from Pytorchic BERT's code
Dataprocessing_util
- backtranslation.py : Backtraslation script
- fewshotprocessing.py : Fewshot preprocessing script
- Reuters_processing.py : Processing Reuters dataset

Pre-works

- Download pre-trained BERT model

First, you have to download pre-trained BERT_base from Google's BERT repository. After running, you can get the pre-trained BERT_base_Uncased model at /BERT_Base_Uncased director and /data

preparing datasets: Reuters,20newsgroup, stackoverflow, SearchSnippets

Example usage

Please install the required package

And then run main.py script

You can choose different models in main.py script

All the evaluation result will print in your screen, and you can also save them

Acknowledgement

Thanks to references of UDA and Pytorchic BERT, I can implement this code.

boscoj2008 / SCL-for-Text-Clustering