sweaterr / KR-SBERT

KoRean based SBERT pre-trained models (KR-SBERT) for PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KR-SBERT

A pretrained Korean-specific Sentence-BERT model (Reimers and Gurevych 2019) developed by Computational Linguistics Lab at Seoul National University.

How to use the KR-SBERT model in Python

Usage

We recommend Python 3.6 or higher and sentence-transformers v2.2.0 or higher.

>>> from sentence_transformers import SentenceTransformer, util
>>> model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS') 
>>> sentences = ['잠이 옵니다', '졸음이 옵니다', '기차가 옵니다']
>>> vectors = model.encode(sentences) # encode sentences into vectors
>>> similarities = util.cos_sim(vectors, vectors) # compute similarity between sentence vectors
>>> print(similarities)
tensor([[1.0000, 0.6577, 0.2732],
        [0.6577, 1.0000, 0.2730],
        [0.2732, 0.2730, 1.0000]])

You can see the sentence '잠이 옵니다' is more similar to '졸음이 옵니다' (cosine similarity 0.65774536) than '기차가 옵니다' (cosine similarity 0.27321893).

Model description

Pre-trained BERT model

Using a pre-trained BERT model, a sentence is segmented into WordPiece tokens, of which contextualized output vectors are mean-pooled into a single sentence vector. We use KR-BERT-V40K, a variant of KR-BERT.

Fine-tuning SBERT model

Two sentence vectors are fed into a classifier for fine-tuning. Then, a siamese network trains the BERT weight parameters to reflect the relation between the two sentences. Finally, we fine-tune the SBERT model on the KLUE-NLI dataset and the augmented KorSTS dataset.

Augmenting KorSTS dataset

We augment the KorSTS dataset using the in-domain straregy proposed by Thakur et al. (2021).

Test datasets

from sentence_transformers import SentenceTransformer, util
import pandas as pd

model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
data = pd.read_csv('path/to/the/dataset')

vec1 = model.encode(data['sent1'], show_progress_bar=True, batch_size=32)
vec2 = model.encode(data['sent2'], show_progress_bar=True, batch_size=32)

data['similarity'] = [
    util.cos_sim(sent1, sent2).squeeze()
    for sent1, sent2 in tqdm(zip(vec1, vec2), total=len(data))
]

data[['paraphrase', 'similarity']].to_csv('paraphrase-results.csv')

Application for document classification

Tutorial in Google Colab: https://colab.research.google.com/drive/1S6WSjOx9h6Wh_rX1Z2UXwx9i_uHLlOiM

Model Accuracy
KR-SBERT-Medium-NLI-STS 0.8400
KR-SBERT-V40K-NLI-STS 0.8400
KR-SBERT-V40K-NLI-augSTS 0.8511
KR-SBERT-V40K-klueNLI-augSTS 0.8628

Dependencies

Citation

@misc{kr-sbert,
  author = {Park, Suzi and Hyopil Shin},
  title = {KR-SBERT: A Pre-trained Korean-specific Sentence-BERT model},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snunlp/KR-SBERT}}
}

About

KoRean based SBERT pre-trained models (KR-SBERT) for PyTorch

License:MIT License