zlszhonglongshen / chinese_sentence_embeddings

bert_avg,bert_whitening,sbert,consert,simcse,esimcse 中文句向量表示

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chinese_sentence_embeddings

bert_avg,bert_whitening,sbert,consert,simcse,esimcse 中文句向量表示

主要是跑了以下zhoujx4/NLP-Series-sentence-embeddings: NLP句子编码、句子embedding、语义相似度:BERT_avg、BERT_whitening、SBERT、SmiCSE (github.com)的代码。使用的预训练的模型是:hfl_chinese-roberta-wwm-ext。其中sup_simcse(SNLI)需要先运行data_utils.py生成SNLI处理后的数据。原始的数据可以去:https://github.com/pluto-junzeng/CNSD下载。

model   dev test
bert_avg    Pearson: 0.254920    Spearman: 0.205940

bert_whitening  Pearson: 0.758313    Spearman: 0.688894

sbert
2022-04-06 17:43:18 - Cosine-Similarity :	Pearson: 0.8251	Spearman: 0.8257
2022-04-06 17:43:18 - Manhattan-Distance:	Pearson: 0.7989	Spearman: 0.8093
2022-04-06 17:43:18 - Euclidean-Distance:	Pearson: 0.7980	Spearman: 0.8084
2022-04-06 17:43:18 - Dot-Product-Similarity:	Pearson: 0.7874	Spearman: 0.7956

2022-04-06 17:43:25 - Cosine-Similarity :	Pearson: 0.7919	Spearman: 0.7837
2022-04-06 17:43:25 - Manhattan-Distance:	Pearson: 0.7706	Spearman: 0.7694
2022-04-06 17:43:25 - Euclidean-Distance:	Pearson: 0.7705	Spearman: 0.7690
2022-04-06 17:43:25 - Dot-Product-Similarity:	Pearson: 0.7550	Spearman: 0.7536

unsup_consert:
2022-04-06 17:58:23 - Cosine-Similarity :	Pearson: 0.7788	Spearman: 0.7808
2022-04-06 17:58:23 - Manhattan-Distance:	Pearson: 0.7497	Spearman: 0.7684
2022-04-06 17:58:23 - Euclidean-Distance:	Pearson: 0.7503	Spearman: 0.7691
2022-04-06 17:58:23 - Dot-Product-Similarity:	Pearson: 0.7572	Spearman: 0.7629

2022-04-06 17:58:27 - Cosine-Similarity :	Pearson: 0.7325	Spearman: 0.7241
2022-04-06 17:58:27 - Manhattan-Distance:	Pearson: 0.7126	Spearman: 0.7118
2022-04-06 17:58:27 - Euclidean-Distance:	Pearson: 0.7128	Spearman: 0.7116
2022-04-06 17:58:27 - Dot-Product-Similarity:	Pearson: 0.7182	Spearman: 0.7086

unsup_simcse:
2022-04-06 18:28:59 - Cosine-Similarity :	Pearson: 0.7835	Spearman: 0.7867
2022-04-06 18:28:59 - Manhattan-Distance:	Pearson: 0.7715	Spearman: 0.7875
2022-04-06 18:28:59 - Euclidean-Distance:	Pearson: 0.7717	Spearman: 0.7875
2022-04-06 18:28:59 - Dot-Product-Similarity:	Pearson: 0.7783	Spearman: 0.7822

2022-04-06 18:29:03 - Cosine-Similarity :	Pearson: 0.7539	Spearman: 0.7454
2022-04-06 18:29:03 - Manhattan-Distance:	Pearson: 0.7414	Spearman: 0.7424
2022-04-06 18:29:03 - Euclidean-Distance:	Pearson: 0.7420	Spearman: 0.7432
2022-04-06 18:29:03 - Dot-Product-Similarity:	Pearson: 0.7550	Spearman: 0.7450

sup_simcse:
2022-04-07 13:58:43 - Cosine-Similarity :	Pearson: 0.6292	Spearman: 0.6529
2022-04-07 13:58:43 - Manhattan-Distance:	Pearson: 0.6544	Spearman: 0.6555
2022-04-07 13:58:43 - Euclidean-Distance:	Pearson: 0.6514	Spearman: 0.6531
2022-04-07 13:58:43 - Dot-Product-Similarity:	Pearson: 0.6299	Spearman: 0.6548

2022-04-07 13:58:47 - Cosine-Similarity :	Pearson: 0.6400	Spearman: 0.6550
2022-04-07 13:58:47 - Manhattan-Distance:	Pearson: 0.6684	Spearman: 0.6584
2022-04-07 13:58:47 - Euclidean-Distance:	Pearson: 0.6631	Spearman: 0.6542
2022-04-07 13:58:47 - Dot-Product-Similarity:	Pearson: 0.6351	Spearman: 0.6398

unsup_esimcse:
2022-04-07 10:18:45 - Cosine-Similarity :	Pearson: 0.7881	Spearman: 0.7901
2022-04-07 10:18:45 - Manhattan-Distance:	Pearson: 0.7738	Spearman: 0.7912
2022-04-07 10:18:45 - Euclidean-Distance:	Pearson: 0.7743	Spearman: 0.7921
2022-04-07 10:18:45 - Dot-Product-Similarity:	Pearson: 0.7822	Spearman: 0.7854

2022-04-07 10:18:49 - Cosine-Similarity :	Pearson: 0.7467	Spearman: 0.7393
2022-04-07 10:18:49 - Manhattan-Distance:	Pearson: 0.7324	Spearman: 0.7382
2022-04-07 10:18:49 - Euclidean-Distance:	Pearson: 0.7322	Spearman: 0.7384
2022-04-07 10:18:49 - Dot-Product-Similarity:	Pearson: 0.7452	Spearman: 0.7379
模型 Chinese-STS-B-dev Chinese-STS-B-test 训练参数
bert_avg 0.2549 0.2059 batch_size=32. max_len=64, pooling=cls
bert_whitening 0.7583 0.6888 /
sbert 0.8257 0.7837 batch_size=32. max_len=64, epoch=2, lr=2e-5
unsup_consert 0.7808 0.7241 batch_size=32. max_len=64, epoch=2, lr=2e-5
unsup_simcse 0.7867 0.7454 batch_size=32. max_len=64, epoch=2, lr=2e-5
sup_simcse(SNLI) 0.6529 0.6550 batch_size=32. max_len=64, epoch=1, lr=2e-5
(感觉这个有问题)
unsup_esimcse 0.7901 0.7393 batch_size=32. max_len=64, epoch=4, lr=2e-5

还可以去参考:

NLP-model/model/model/Torch_model/SimCSE-Chinese at main · zhengyanzhao1997/NLP-model (github.com)

vdogmcgee/SimCSE-Chinese-Pytorch: SimCSE在中文上的复现,有监督+无监督 (github.com)

KwangKa/SIMCSE_unsup: 中文无监督SimCSE Pytorch实现 (github.com)

发现其实代码都是差不多的= =

About

bert_avg,bert_whitening,sbert,consert,simcse,esimcse 中文句向量表示


Languages

Language:Python 100.0%