Whoolly / sentence_bert_chinese

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pretrain sentence transformers on Chinese text matching dataset.

Traing Data

Inlcuding STS(Semantic Textual Similarity) task and NLI (Natural Language Inference) task.
Most datas are from CLUEDatasetSearch.

Training Detail

Acording to the paper, after training 1 epoch on NLI data, training 2 epoches on STS data.
The original BERT from ymcui/Chinese-BERT-wwm, using RTB3(small size) and Robert_wwm_ext(bert_base size)

# Modify the data path in training_src/train.py
python train.py

Getting Model

use Huggingface-Transformers

model model_name
rtb3 imxly/sentence_rtb3
roberta_wwm_ext imxly/sentence_roberta_wwm_ext

How to use

pip install sentence_transformers
from sentence_transformers import models, SentenceTransformer

model_name = 'imxly/sentence_rtb3'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


def evaluate(model):
    '''
    余弦相似度计算
    '''
    import numpy as np
    v1 = model.encode(s1)
    v2 = model.encode(s2)
    v1 = v1 / np.linalg.norm(v1)
    v2 = v2 / np.linalg.norm(v2)
    return v1.dot(v2)


s1 = '公积金贷款能贷多久'
s2 = '公积金贷款的期限'
print(evaluate(model))```

About


Languages

Language:Python 100.0%