JustToBeYourself / SimCSE-Pytorch

中文数据集下SimCSE+ESimCSE的实现

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SimCSE Inplemention

SimCSE在中文上无监督 + 有监督 pytorch版

SimCSE:https://arxiv.org/pdf/2104.08821.pdf
ESimCSE: https://arxiv.org/pdf/2109.04380.pdf

1.database: SNS-B (uploaded)

directory: data/SNS-B/

2.environment

torch==1.8.2
transformers==4.12.3

video card: 3060Ti 8G
Due to the limitation of the graphics card, the batch_size is set very small.
You can try increasing the batch_size to get better results with video memory allowed.

3.how to run?

SimCSE: python train.py
ESimCSE: python ESimCSE_train.py

4.Result (un-supervised)
spearman corrcoef is shown as result below:

Model un_supervised
Bert_base 0.538
SimCSE 0.692
ESimCSE 0.707

说明:原论文的无监督SimCSE基于英文,从维基百科上挑了100万个句子进行训练的。本项目评测实验是在中文数据集STS-B(已上传),实现结果以苏剑林科学空间结果 对照。 SimCSE结果与其一致。
img.png
以上供参考,码代码不易,有用请点个赞喔。

About

中文数据集下SimCSE+ESimCSE的实现

License:MIT License


Languages

Language:Python 100.0%