monologg / GoEmotions-Korean

Korean version of GoEmotions Dataset 😍😒😱

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GoEmotions-Korean

GoEmotions 데이터셋을 ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•œ ν›„, KoELECTRA둜 ν•™μŠ΅

Updates

June 19, 2020 - Transformers v2.9.1 κΈ°μ€€μœΌλ‘œ λͺ¨λΈ ν•™μŠ΅ μ‹œ [NAME], [RELIGION]κ³Ό 같은 Special token을 μΆ”κ°€ν•˜μ˜€μŒμ—λ„ pipelineμ—μ„œ λ‹€μ‹œ μ‚¬μš©ν•  λ•Œ 적용이 λ˜μ§€ μ•ŠλŠ” μ΄μŠˆκ°€ μžˆμ—ˆμœΌλ‚˜, Transformers v2.11.0μ—μ„œ ν•΄λ‹Ή μ΄μŠˆκ°€ ν•΄κ²°λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Feb 9, 2021 - Transformers v3.5.1 κΈ°μ€€μœΌλ‘œ KoELECTRA-v1, KoELECTRA-v3λ₯Ό 가지고 ν•™μŠ΅ν•˜μ—¬ μƒˆλ‘œ λͺ¨λΈμ„ μ—…λ‘œλ“œ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

GoEmotions

58000개의 Reddit commentsλ₯Ό 28개의 emotion으둜 λΌλ²¨λ§ν•œ 데이터셋

  • admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral

Requirements

  • torch==1.7.1
  • transformers=3.5.1
  • googletrans==2.4.1
  • attrdict==2.0.1
$ pip3 install -r requirements.txt

Translated Data

🚨 Reddit λŒ“κΈ€λ‘œ λ§Œλ“  λ°μ΄ν„°μ—¬μ„œ λ²ˆμ—­λœ 결과물의 ν’ˆμ§ˆμ΄ 쒋지 μ•ŠμŠ΅λ‹ˆλ‹€. 🚨

  • pygoogletransλ₯Ό μ‚¬μš©ν•˜μ—¬ ν•œκ΅­μ–΄ 데이터 생성
    • pygoogletrans v2.4.1이 pypi에 μ—…λ°μ΄νŠΈλ˜μ§€ μ•Šμ€ κ΄€κ³„λ‘œ repositoryμ—μ„œ κ³§λ°”λ‘œ 라이브러리λ₯Ό μ„€μΉ˜ν•˜λŠ” 것을 ꢌμž₯ (requirements.txt에 λͺ…μ‹œλ˜μ–΄ 있음)
  • API 호좜 간에 1.5초의 간격을 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.
    • ν•œ 번의 request에 μ΅œλŒ€ 5000자λ₯Ό 넣을 수 μžˆλŠ” 점을 κ³ λ €ν•˜μ—¬ λ¬Έμž₯듀을 \r\n으둜 이어 λΆ™μ—¬ input으둜 λ„£μ—ˆμŠ΅λ‹ˆλ‹€.
  • ​​​(Zero-width space)κ°€ λ²ˆμ—­ λ¬Έμž₯ μ•ˆμ— 있으면 λ²ˆμ—­μ΄ λ˜μ§€ μ•ŠλŠ” 였λ₯˜κ°€ μžˆμ–΄μ„œ μ΄λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • λ²ˆμ—­μ„ μ™„λ£Œν•œ λ°μ΄ν„°λŠ” data 디렉토리에 이미 μžˆμŠ΅λ‹ˆλ‹€. ν˜Ήμ—¬λ‚˜ 직접 λ²ˆμ—­μ„ 돌리고 μ‹Άλ‹€λ©΄ μ•„λž˜μ˜ λͺ…λ Ήμ–΄λ₯Ό μ‹€ν–‰ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ bash download_original_data.sh
$ pip3 install git+git://github.com/ssut/py-googletrans
$ python3 tranlate_data.py

Tokenizer

  • 데이터셋에 [NAME], [RELIGION]의 Special Token이 μ‘΄μž¬ν•˜μ—¬, 이λ₯Ό vocab.txt의 [unused0]와 [unused1]에 각각 ν• λ‹Ήν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Train & Evaluation

  • Sigmoidλ₯Ό μ μš©ν•œ Multi-label classification (thresholdλŠ” 0.3으둜 지정)
    • model.py의 ElectraForMultiLabelClassification μ°Έκ³ 
  • config의 경우 config λ””λ ‰ν† λ¦¬μ˜ json νŒŒμΌμ—μ„œ λ³€κ²½ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ python3 run_goemotions.py --config_file koelectra-base.json
$ python3 run_goemotions.py --config_file koelectra-small.json

Results

Macro F1을 κΈ°μ€€μœΌλ‘œ κ²°κ³Ό μΈ‘μ • (Best result)

Macro F1 (%) Dev Test
KoELECTRA-small-v1 39.99 41.02
KoELECTRA-base-v1 42.18 44.03
KoELECTRA-small-v3 40.27 40.85
KoELECTRA-base-v3 42.85 42.28

Pipeline

  • MultiLabelPipeline 클래슀λ₯Ό μƒˆλ‘œ λ§Œλ“€μ–΄ Multi-label classification에 λŒ€ν•œ inferenceκ°€ κ°€λŠ₯ν•˜κ²Œ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • Huggingface s3에 λͺ¨λΈμ„ μ—…λ‘œλ“œν•˜μ˜€μŠ΅λ‹ˆλ‹€.
    • monologg/koelectra-small-v1-goemotions
    • monologg/koelectra-base-v1-goemotions
    • monologg/koelectra-small-v3-goemotions
    • monologg/koelectra-base-v3-goemotions
from multilabel_pipeline import MultiLabelPipeline
from transformers import ElectraTokenizer
from model import ElectraForMultiLabelClassification
from pprint import pprint


tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-goemotions")
model = ElectraForMultiLabelClassification.from_pretrained("monologg/koelectra-base-v3-goemotions")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "μ „ν˜€ 재미 μžˆμ§€ μ•ŠμŠ΅λ‹ˆλ‹€ ...",
    "λ‚˜λŠ” β€œμ§€κΈˆ κ°€μž₯ 큰 두렀움은 λ‚΄ μƒμž μ•ˆμ— μ‚¬λŠ” 것” 이라고 λ§ν–ˆλ‹€.",
    "κ³±μ°½... ν•œμ‹œκ°„λ°˜ 기닀릴 맛은 μ•„λ‹˜!",
    "μ• μ •ν•˜λŠ” 곡간을 μ• μ •ν•˜λŠ” μ‚¬λžŒλ“€λ‘œ μ±„μšΈλ•Œ",
    "λ„ˆλ¬΄ μ’‹μ•„",
    "λ”₯λŸ¬λ‹μ„ μ§μ‚¬λž‘μ€‘μΈ ν•™μƒμž…λ‹ˆλ‹€!",
    "마음이 급해진닀.",
    "μ•„λ‹ˆ μ§„μ§œ λ‹€λ“€ λ―Έμ³€λ‚˜λ΄¨γ…‹γ…‹γ…‹",
    "κ°œλ…ΈμžΌ"
]

pprint(goemotions(texts))

# Output
[{'labels': ['disapproval'], 'scores': [0.97151965]},
 {'labels': ['fear'], 'scores': [0.9519822]},
 {'labels': ['disapproval', 'neutral'], 'scores': [0.452921, 0.5345312]},
 {'labels': ['love'], 'scores': [0.8750478]},
 {'labels': ['admiration'], 'scores': [0.93127275]},
 {'labels': ['love'], 'scores': [0.9093589]},
 {'labels': ['nervousness', 'neutral'], 'scores': [0.76960915, 0.33462417]},
 {'labels': ['disapproval'], 'scores': [0.95657086]},
 {'labels': ['annoyance', 'disgust'], 'scores': [0.39240348, 0.7896941]}]

Reference

About

Korean version of GoEmotions Dataset 😍😒😱

License:Apache License 2.0


Languages

Language:Python 99.7%Language:Shell 0.3%