boscoj2008 / SCL-for-Text-Clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

self-supervise contrastive learning for long and short texts clustering

This is code of paper of IEEE SMC 2021:"A Simple and Effective Usage of Self-supervised Contrastive Learning for Text Clustering" [paper]

Overview

  • train.py : The main script for this this project, including default BERT, loss function, and clustering accuracy metric
  • load_data.py : Load the data of sup, unsup
  • models.py : Model calsses for a general transformer (from Pytorchic BERT's code) and LSTM model
  • train.py : A custom training class(Trainer class)
  • utils
    • configuration.py : Set a configuration from json file
    • checkpoint.py : Functions to load a model from tensorflow's file (from Pytorchic BERT's code)
    • optim.py : Optimizer (BERTAdam class) (from Pytorchic BERT's code)
    • tokenization.py : Tokenizers adopted from the original Google BERT's code
    • utils.py : A custom utility functions adopted from Pytorchic BERT's code
  • Dataprocessing_util

Pre-works

- Download pre-trained BERT model

First, you have to download pre-trained BERT_base from Google's BERT repository. After running, you can get the pre-trained BERT_base_Uncased model at /BERT_Base_Uncased director and /data

preparing datasets: Reuters,20newsgroup, stackoverflow, SearchSnippets

Example usage

Please install the required package

And then run main.py script

You can choose different models in main.py script

All the evaluation result will print in your screen, and you can also save them

Acknowledgement

Thanks to references of UDA and Pytorchic BERT, I can implement this code.

About


Languages

Language:Python 100.0%