AshleyWang007 / KoSentenceBERT_ETRI

🌷 Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using ETRI KoBERT and kakaobrain KorNLU dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ko-Sentence-BERT

🌷 Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using ETRI KoBERT and kakaobrain KorNLU dataset

Installation

  • ETRI KorBERTλŠ” transformers 2.4.1 ~ 2.8.0μ—μ„œλ§Œ λ™μž‘ν•˜κ³  Sentence-BERTλŠ” 3.1.0 버전 μ΄μƒμ—μ„œ λ™μž‘ν•˜μ—¬ 라이브러리λ₯Ό μˆ˜μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • huggingface transformer, sentence transformers, tokenizers 라이브러리 μ½”λ“œλ₯Ό 직접 μˆ˜μ •ν•˜λ―€λ‘œ κ°€μƒν™˜κ²½ μ‚¬μš©μ„ ꢌμž₯ν•©λ‹ˆλ‹€.
  • μ‚¬μš©ν•œ Docker imageλŠ” Docker Hub에 μ²¨λΆ€ν•©λ‹ˆλ‹€.
  • ETRI KoBERTλ₯Ό μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅ν•˜μ˜€κ³  λ³Έ λ ˆνŒŒμ§€ν† λ¦¬μ—μ„  ETRI KoBERTλ₯Ό μ œκ³΅ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
  • SKT KoBERTλ₯Ό μ‚¬μš©ν•œ 버전은 λ‹€μŒ λ ˆνŒŒμ§€ν† λ¦¬μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
git clone https://github.com/BM-K/KoSentenceBERT.git
python -m venv .KoSBERT
. .KoSBERT/bin/activate
pip install -r requirements.txt
  • transformer, tokenizers, sentence_transformers 디렉토리λ₯Ό .KoSBERT/lib/python3.7/site-packages/ 둜 μ΄λ™ν•©λ‹ˆλ‹€.
  • ETRI_KoBERT λͺ¨λΈκ³Ό tokenizerκ°€ KoSentenceBERT 디렉토리 μ•ˆμ— μ‘΄μž¬ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.
  • ETRI λͺ¨λΈκ³Ό tokenizerλŠ” λ‹€μŒ μ˜ˆμ‹œμ™€ 같이 λΆˆλŸ¬μ˜΅λ‹ˆλ‹€ :
from ETRI_tok.tokenization_etri_eojeol import BertTokenizer
self.auto_model = BertModel.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch') 
self.tokenizer = BertTokenizer.from_pretrained('./ETRI_KoBERT/003_bert_eojeol_pytorch/vocab.txt', do_lower_case=False)

Train Models

  • λͺ¨λΈ ν•™μŠ΅μ„ μ›ν•˜μ‹œλ©΄ KoSentenceBERT 디렉토리 μ•ˆμ— KorNLUDatasets이 μ‘΄μž¬ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.
  • STS ν•™μŠ΅ μ‹œ λͺ¨λΈ ꡬ쑰에 맞게 데이터λ₯Ό μˆ˜μ •ν•˜μ—¬ μ‚¬μš©ν•˜μ˜€μœΌλ©°, 데이터와 ν•™μŠ΅ 방법은 μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€ :

    KoSentenceBERT/KorNLUDatasets/KorSTS/tune_test.tsv

    STS test λ°μ΄ν„°μ…‹μ˜ 일뢀
python training_nli.py      # NLI λ°μ΄ν„°λ‘œλ§Œ ν•™μŠ΅
python training_sts.py      # STS λ°μ΄ν„°λ‘œλ§Œ ν•™μŠ΅
python con_training_sts.py  # NLI λ°μ΄ν„°λ‘œ ν•™μŠ΅ ν›„ STS λ°μ΄ν„°λ‘œ Fine-Tuning

Pre-Trained Models

pooling modeλŠ” MEAN-strategyλ₯Ό μ‚¬μš©ν•˜μ˜€μœΌλ©°, ν•™μŠ΅μ‹œ λͺ¨λΈμ€ output 디렉토리에 μ €μž₯ λ©λ‹ˆλ‹€.

디렉토리 ν•™μŠ΅λ°©λ²•
training_nli_ETRI_KoBERT-003_bert_eojeol Only Train NLI
training_sts_ETRI_KoBERT-003_bert_eojeol Only Train STS
training_nli_sts_ETRI_KoBERT-003_bert_eojeol STS + NLI

Performance

Seed κ³ μ •, test set

Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
NLl 67.96 70.45 71.06 70.48 71.17 70.51 64.87 63.04
STS 80.43 79.99 78.18 78.03 78.13 77.99 73.73 73.40
STS + NLI 80.10 80.42 79.14 79.28 79.08 79.22 74.46 74.16

Application Examples

  • 생성 된 λ¬Έμž₯ μž„λ² λ”©μ„ λ‹€μš΄ 슀트림 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ— μ‚¬μš©ν•  수 μžˆλŠ” 방법에 λŒ€ν•œ λͺ‡ 가지 예λ₯Ό μ œμ‹œν•©λ‹ˆλ‹€.
  • STS + NLI pretrained λͺ¨λΈμ„ 톡해 μ§„ν–‰ν•©λ‹ˆλ‹€.

Semantic Search

SemanticSearch.pyλŠ” 주어진 λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ λ¬Έμž₯을 μ°ΎλŠ” μž‘μ—…μž…λ‹ˆλ‹€.
λ¨Όμ € Corpus의 λͺ¨λ“  λ¬Έμž₯에 λŒ€ν•œ μž„λ² λ”©μ„ μƒμ„±ν•©λ‹ˆλ‹€.

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
           '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
           'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
        


κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€ :

========================


Query: ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.

Top 5 most similar sentences in corpus:
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.7557)
ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€. (Score: 0.6464)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.2565)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.2333)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.1792)


========================


Query: 고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.

Top 5 most similar sentences in corpus:
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.6732)
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.3401)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.1037)
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.0617)
κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€. (Score: 0.0466)


=======================


Query: μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.

Top 5 most similar sentences in corpus:
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.7164)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.3216)
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.2071)
ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€. (Score: 0.1089)
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.0724)

Clustering

Clustering.pyλŠ” λ¬Έμž₯ μž„λ² λ”© μœ μ‚¬μ„±μ„ 기반으둜 μœ μ‚¬ν•œ λ¬Έμž₯을 ν΄λŸ¬μŠ€ν„°λ§ν•˜λŠ” 예λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.
이전과 λ§ˆμ°¬κ°€μ§€λ‘œ λ¨Όμ € 각 λ¬Έμž₯에 λŒ€ν•œ μž„λ² λ”©μ„ κ³„μ‚°ν•©λ‹ˆλ‹€.

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = './output/training_nli_sts_ETRI_KoBERT-003_bert_eojeol'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
          '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
          'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€ :

Cluster  1
['두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†μœΌλ‘œ λ°€μ—ˆλ‹€.', 'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.', 'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

Cluster  2
['ν•œ λ‚¨μžκ°€ 말을 탄닀.', 'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.']

Cluster  3
['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.']

Cluster  4
['κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.', 'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.']

Cluster  5
['μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.', '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.']

Downstream Tasks Demo




Citing

KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

@article{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    journal= "arXiv preprint arXiv:2004.09813",
    month = "04",
    year = "2020",
    url = "http://arxiv.org/abs/2004.09813",
}

About

🌷 Korean SentenceBERT : Sentence Embeddings using Siamese BERT-Networks using ETRI KoBERT and kakaobrain KorNLU dataset


Languages

Language:Python 94.7%Language:CSS 2.2%Language:SCSS 2.1%Language:JavaScript 0.5%Language:HTML 0.4%